Machine Learning 02: MLE, MAP, and Naïve Bayes
- Probablistic learning
- Bayes Rule
- Joint Probability Distribution
- MLE and MAP
- MLE - Maximum Likelihood Estimation
- MAP - Maximum a Posteriori Probability Estimation
- Classifiers based on probability
- Naive Bayes
- K-Fold Cross Validation
- Another way to view Naive Bayes
- Naïve Bayes: classifying text documents
Course Notes of Professor Tom Mitchell Machine Learning @ CMU
Probablistic function approximation:
Definition of conditional probability:
The chain rule
we call P(A) the “prior” and P(A|B) the “posterior”
Other forms of Bayes Ruls
E.g. of using Bayes:
A = you have a flu, B = you just coughed
P(A) = 0.05, P(B|A) = 0.80, P(B|~A) = 0.20
P(A|B) = P(B|A)P(A) / P(B) = P(B|A)P(A) / (P(B|A)P(A) + P(B|~A)P(~A)) = 0.8 x 0.05 / (0.8 x 0.05 + 0.2 x 0.95)
Joint Probability Distribution
The awesome Joint Probability Distribution:
from which we can calculate
P(X1|X2...Xn) and every other probability we desire over subsets of
- Make a table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows).
- For each combination of values, say how probable it is.
- If you subscribe to the axioms of probability, those numbers must sum to 1.
Once you have the Joint Distribution, you can ask for the probability of any logical expression involving these variables. E.g. To know P(A|B), we can get the data of P(A^B), P(B) from JD data table and calculate P(A|B) by conditional probability rule.
But the main problem: learning P(Y|X) can require more data than we have. E.g. JD with 100 attributes, then there will be 2^100 # of rows in the table.
Be smart about how we estimate probabilities from sparse data
- maximum likelihood estimates
- maximum a posteriori estimates
Be smart about how to represent joint distributions
- Bayes networks, graphical models, conditional independencies
MLE and MAP
- Flip a coin X, and ask you to estimate the probability that it will turn up heads (X = 1) or tails (X = 0).
- You flip it repeatedly, observing
- It turns up heads α1 times
- It turns up tails α0 times
But this will be not accurate if the training data is scarce. So we need MAP in this case.
When data sparse, might bring in prior assumptions to bias our estimate
- e.g., represent priors by “hallucinating” γ1 heads, and γ0 tails, to complement sparse observed α1, α0
MLE - Maximum Likelihood Estimation
MAP - Maximum a Posteriori Probability Estimation
In the case of flipping coin, the data is generated by mutiple i.i.d. draws of a Bernoulli random variable. So our prior assumption of the distribution is the Beta Distribution give the probability distribution function as following:
β1 - 1 = γ1 as we mentioned before.
We say P(θ) is the conjugate prior for P(D|θ) if P(D|θ) has the same form as P(θ)
Classifiers based on probability
Get probability of wealth / poor by other data:
P(Y|X1, X2), the wealth probability can be calculated by given 4 lines of data.
What if we have 100 X parameters instead of 2?
We will need
2^n lines of data to calculate P(Y|X1…X100)
By Bayes Rule:
And Naive Bayes, which assumes:
P(X1...Xn|Y) = ∏_i P(Xi|Y) i.e., that Xi and Xj are conditionally independent given Y, for all i≠j
Naive Bayes uses assumption that the Xi are conditionally independent, given Y. E.g.,
P(X1|X2,Y) = P(X1|Y)
Given this assumption, then:
P(X1,X2|Y) = P(X1|X2,Y)P(X2|Y) = P(X1|Y)P(X2|Y) in general: P(X1...Xn|Y) = ∏_i P(Xi|Y)
Note that, we reduce the # of parameters from
2^n (n is the parameter # of attributes X) to
2 x n (assume Y is true/false, we do not do this with 2 x 2^n because if y1 has been calculated, the y2 = 1 - y1 in the case of binomial). It is a big progress.
y_k is the possible value of Y classes,
x_j is the possible value of X attribute, and i is the number of X attributes.
Here is an example in course:
Note that 0.026 and 0.091 is the "score" given assumption that S = 0 / 1. The probability should be calculated by 0.026/(0.026+0.091) and 0.091/(0.026+0.091), which are 0.22 and 0.78 correspondingly.
K-Fold Cross Validation
Idea: train multiple times, leaving out a disjoint subset of data each time for test. Average the test set accuracies.
Partition data into K disjoint subsets For k=1 to K testData = kth subset h <- classifier trained on all data except for testData accuracy(k) = accuracy of h on testData end FinalAccuracyEstimate = mean of the K recorded testset accuracies
Another way to view Naive Bayes
Another way to view Naive Bayes:
Note that the function above is a linear function, which means the naive bayes classifier is a linear classifier.
Naïve Bayes: classifying text documents
Classify which emails are spam?
Classify which emails promise an attachment?