Lecture 12 - 188 200 discrete mathematics and linear algebra
Bayesian spam filtering for multiple keywords
Definition: If A and B are events in a samplespace S , then A and B are independent iff
What we are saying is the outcome of A doesn’tdepend in any way on the outcome of B, andconversely.
Example 1: Suppose a coin is tossed twice.
The first toss could turn H or T , and wouldnot depend on the outcome of the second toss.
The second toss could also turn H or T andwould not depend on the outcome of the firsttoss.
Does knowing that the coin comes up tail on thefirst toss help you predict the second toss?
S = {HH, HT , TH, TT }Let A = “Coin was tail on the first toss” ={TH, TT }Let B = “Coin was tail on the second toss” ={HT , TT }P(B) = 1/2
Conclusion: Knowing the outcome of the first tossdoes not help you guess the outcome of the secondtoss.
It would be natural to think that mutually disjointevents events would be independent, in fact almostthe opposite is true: Disjoint events with non-zeroprobabilities are dependent.
Example 2: Let A and B be events on S , andsuppose A ∩ B = ∅, P(A) = 0 and P(B) = 0. Showthat P(A ∩ B) = P(A) · P(B).
Because A ∩ B = ∅, P(A ∩ B) = 0.
But P(A) · P(B) = 0 because P(A) = 0 andP(B) = 0.
Example 3: Suppose that A is the event that arandomly generated bit string of length four beginswith a 1, and B is the event that this bit stringcontains an even number of 1s. Are A and Bindependent if all 4-bit strings are equally likely tooccur?
A ={1111, 1110, 1101, 1011, 1100, 1010, 1001, 1000}
B ={0000, 0011, 0101, 0110, 1001, 1010, 1100, 1111}
Since P(A ∩ B) = P(A) · P(B), A and B areindependent events.
Example 4: Assume that each of the four waysthat a family can have two children are equallylikely. Are the events E that a family with twochildren has two boys, and F that a family with twochildren has at least one boy independent?
Since 1/4 = 3/16, E and F , are notindependent.
Product Rule to Determine Probability ofCombinations of Events.
If probabilities are independent, we can use theproduct rule to determine the probabilities ofcombinations of events.
Example 5: What is the probability of flippingheads 4 times in a row using a fair coin?
so P(HHHH) = P(H) · P(H) · P(H) · P(H) =(1/2)4 = 1/16 because probabilities of flippinghead each time are independent.
Example 6: What is the probability of rolling thesame number 3 times in a row using an unbiased6-sided die?
First roll agrees with itself with probability 1/6.
Second roll agrees with first with probability1/6.
Third roll agrees with first two with probability1/6.
So probability of rolling the same number 6times is (1/6) · (1/6) · (1/6) = 1/36.
Problem: We want to create a spam filter usingkeywords.
Specifically, we want to develop a Bayesian filter fortwo keywords that tells us P[A | (B1 ∩ B2)]
B1 = “an email contains the first questionable
1. Events B1 and B2 are independent. 2. The events B1|A and B2|A are independent. 3. P(A) = P(A) = 0.5
= P[(B1 ∩ B2) | A]P(A) + P[(B1 ∩ B2) | A]P(A)
= P[(B1 ∩ B2) | A] + P[(B1 ∩ B2) | A]
and B2, and B1|A and B2|A are independent.
Example 7: Suppose that we train a Bayesianspam filter on a set of 2000 spam emails and 1000emails that are not spam. The word “viagra”appears in 400 spam emails and 60 good emails,and the word “discount” appears in 200 spamemails and 25 good emails. Estimate the probabilitythat a message containing the words viagra” and“discount” is spam. Will we reject this message ifour spam threshold is set at 0.9?
B1 = email contains the word “viagra”B2 = message contains the word “discount”
P(B1|A) = 400/2000 = 0.2P(B1|A) = 60/1000 = 0.06p(B2|A) = 200/2000 = 0.1p(B2|A) = 25/1000 = 0.025
P[A | (B1 ∩ B2)] = 0.2(0.1) + 0.06(0.025)
Conclusion: Since the probability that our email isspam given that it contains the string “viagra” and“discount” is approximately 0.9302 > 0.9, we willflag this email as spam.
What about formula for more than two keywords?
Advantage: it can be trained on a per-user basis. • A scientist who is researching on Viagra won’thave emails containing the word “Viagra” flagged asspam, because “Viagra” will show up often in hisgood emails.
Disadvantage:• Assume that keywords are independent. • Can’t filter image.
Bayesian spam filtering for multiple keywords
FDA Public Health Advisory Suicidality in Children and Adolescents Being This information is out-of-date. For current information on antidepraessant drugs, please see http://www.fda.gov/cder/drug/antidepressants/default.htm Today the Food and Drug Administration (FDA) directed manufacturers of all antidepressant drugs to revise the labeling for their products to include a boxed warning and
INSTRUCTIVO PARA LA APLICACIÓN DEL DECRETO NO. 39-03 QUE CREA LAS COMISIONES DE AUDITORIA SOCIAL CONSIDERANDO: Que la participación de la comunidad en la transparencia de la gestión pública es un componente esencial de la eficacia estatal, al contribuir al ahorro de recursos, velar por la calidad de las realizaciones gubernamentales, dando seguimiento a las inversiones púb