Machine Learning 02: MLE, MAP, and Naïve Bayes
 Probablistic learning
 Bayes Rule
 Joint Probability Distribution
 MLE and MAP
 MLE  Maximum Likelihood Estimation
 MAP  Maximum a Posteriori Probability Estimation
 Classifiers based on probability
 Naive Bayes
 KFold Cross Validation
 Another way to view Naive Bayes
 Naïve Bayes: classifying text documents
 Resources
Course Notes of Professor Tom Mitchell Machine Learning @ CMU
Probablistic learning
Probablistic function approximation:
Definition of conditional probability:
The chain rule
Bayes Rule
we call P(A) the “prior” and P(AB) the “posterior”
Other forms of Bayes Ruls
E.g. of using Bayes:
A = you have a flu, B = you just coughed
Assume:
P(A) = 0.05, P(BA) = 0.80, P(B~A) = 0.20
P(AB) = P(BA)P(A) / P(B) = P(BA)P(A) / (P(BA)P(A) + P(B~A)P(~A)) = 0.8 x 0.05 / (0.8 x 0.05 + 0.2 x 0.95)
Joint Probability Distribution
The awesome Joint Probability Distribution: P(X1,X2,...,Xn)
from which we can calculate P(X1X2...Xn)
and every other probability we desire over subsets of X1…Xn
 Make a table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows).
 For each combination of values, say how probable it is.
 If you subscribe to the axioms of probability, those numbers must sum to 1.
Once you have the Joint Distribution, you can ask for the probability of any logical expression involving these variables. E.g. To know P(AB), we can get the data of P(A^B), P(B) from JD data table and calculate P(AB) by conditional probability rule.
But the main problem: learning P(YX) can require more data than we have. E.g. JD with 100 attributes, then there will be 2^100 # of rows in the table.

Be smart about how we estimate probabilities from sparse data
 maximum likelihood estimates
 maximum a posteriori estimates

Be smart about how to represent joint distributions
 Bayes networks, graphical models, conditional independencies
MLE and MAP
E.g.
 Flip a coin X, and ask you to estimate the probability that it will turn up heads (X = 1) or tails (X = 0).
 You flip it repeatedly, observing
 It turns up heads α1 times
 It turns up tails α0 times
But this will be not accurate if the training data is scarce. So we need MAP in this case.
When data sparse, might bring in prior assumptions to bias our estimate
 e.g., represent priors by “hallucinating” γ1 heads, and γ0 tails, to complement sparse observed α1, α0
MLE  Maximum Likelihood Estimation
MAP  Maximum a Posteriori Probability Estimation
In the case of flipping coin, the data is generated by mutiple i.i.d. draws of a Bernoulli random variable. So our prior assumption of the distribution is the Beta Distribution give the probability distribution function as following:
Here, β1  1 = γ1
as we mentioned before.
We say P(θ) is the conjugate prior for P(Dθ) if P(Dθ) has the same form as P(θ)
Classifiers based on probability
E.g.
Get probability of wealth / poor by other data:
P(YX1, X2), the wealth probability can be calculated by given 4 lines of data.
What if we have 100 X parameters instead of 2?
We will need 2^n
lines of data to calculate P(YX1…X100)
By Bayes Rule:
P(YX)=P(XY)P(Y)/P(X)
And Naive Bayes, which assumes:
P(X1...XnY) = ∏_i P(XiY)
i.e., that Xi and Xj are conditionally independent given Y, for all i≠j
Conditional Independence
Naive Bayes uses assumption that the Xi are conditionally independent, given Y. E.g.,
P(X1X2,Y) = P(X1Y)
Given this assumption, then:
P(X1,X2Y) = P(X1X2,Y)P(X2Y)
= P(X1Y)P(X2Y)
in general: P(X1...XnY) = ∏_i P(XiY)
Note that, we reduce the # of parameters from 2^n
(n is the parameter # of attributes X) to 2 x n
(assume Y is true/false, we do not do this with 2 x 2^n because if y1 has been calculated, the y2 = 1  y1 in the case of binomial). It is a big progress.
Naive Bayes
In which, y_k
is the possible value of Y classes, x_j
is the possible value of X attribute, and i is the number of X attributes.
Here is an example in course:
Note that 0.026 and 0.091 is the "score" given assumption that S = 0 / 1. The probability should be calculated by 0.026/(0.026+0.091) and 0.091/(0.026+0.091), which are 0.22 and 0.78 correspondingly.
KFold Cross Validation
Idea: train multiple times, leaving out a disjoint subset of data each time for test. Average the test set accuracies.
Partition data into K disjoint subsets
For k=1 to K
testData = kth subset
h < classifier trained on all data except for testData
accuracy(k) = accuracy of h on testData
end
FinalAccuracyEstimate = mean of the K recorded testset accuracies
Another way to view Naive Bayes
Another way to view Naive Bayes:
Note that the function above is a linear function, which means the naive bayes classifier is a linear classifier.
Naïve Bayes: classifying text documents

Classify which emails are spam?

Classify which emails promise an attachment?
Resources
 The given reading material has the detailed equations: PDF
 CMU, Bag of Words Classification