I am
Wang Hongyang

Wechat Official Account
Find fun things here!

Machine Learning 02: MLE, MAP, and Naïve Bayes

Course Notes of Professor Tom Mitchell Machine Learning @ CMU

Probablistic learning

Probablistic function approximation:

instead of F: X -> Y, learn P(Y | X)

Definition of conditional probability:

P(A | B) = P(A ^ B) / P(B)

The chain rule

P(A ^ B) = P(A | B) P(B)

Bayes Rule

P(B | A) = P(A | B) P(B) / P(A)

we call P(A) the “prior” and P(A|B) the “posterior”

Other forms of Bayes Ruls

E.g. of using Bayes:

A = you have a flu, B = you just coughed


P(A) = 0.05, P(B|A) = 0.80, P(B|~A) = 0.20

what is P(flu | cough) = P(A|B) ?

P(A|B) = P(B|A)P(A) / P(B) = P(B|A)P(A) / (P(B|A)P(A) + P(B|~A)P(~A)) = 0.8 x 0.05 / (0.8 x 0.05 + 0.2 x 0.95)

Joint Probability Distribution

The awesome Joint Probability Distribution: P(X1,X2,...,Xn)

from which we can calculate P(X1|X2...Xn) and every other probability we desire over subsets of X1…Xn

Recipe for making a joint distribution of M variables:

  1. Make a table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows).
  2. For each combination of values, say how probable it is.
  3. If you subscribe to the axioms of probability, those numbers must sum to 1.

Once you have the Joint Distribution, you can ask for the probability of any logical expression involving these variables. E.g. To know P(A|B), we can get the data of P(A^B), P(B) from JD data table and calculate P(A|B) by conditional probability rule.

Sounds like the solution to learn: F: X->Y, or P(Y|X)?

But the main problem: learning P(Y|X) can require more data than we have. E.g. JD with 100 attributes, then there will be 2^100 # of rows in the table.

What to do?

  1. Be smart about how we estimate probabilities from sparse data

    • maximum likelihood estimates
    • maximum a posteriori estimates
  2. Be smart about how to represent joint distributions

    • Bayes networks, graphical models, conditional independencies



  • Flip a coin X, and ask you to estimate the probability that it will turn up heads (X = 1) or tails (X = 0).
  • You flip it repeatedly, observing
    • It turns up heads α1 times
    • It turns up tails α0 times

But this will be not accurate if the training data is scarce. So we need MAP in this case.

When data sparse, might bring in prior assumptions to bias our estimate

  • e.g., represent priors by “hallucinating” γ1 heads, and γ0 tails, to complement sparse observed α1, α0

MLE - Maximum Likelihood Estimation

MAP - Maximum a Posteriori Probability Estimation

In the case of flipping coin, the data is generated by mutiple i.i.d. draws of a Bernoulli random variable. So our prior assumption of the distribution is the Beta Distribution give the probability distribution function as following:

Here, β1 - 1 = γ1 as we mentioned before.

We say P(θ) is the conjugate prior for P(D|θ) if P(D|θ) has the same form as P(θ)

Classifiers based on probability


Get probability of wealth / poor by other data:

P(Y|X1, X2), the wealth probability can be calculated by given 4 lines of data.

What if we have 100 X parameters instead of 2?

We will need 2^n lines of data to calculate P(Y|X1…X100)

How can we avoid such large of data?

By Bayes Rule:


And Naive Bayes, which assumes:

P(X1...Xn|Y) = ∏_i P(Xi|Y)
i.e., that Xi and Xj are conditionally independent given Y, for all i≠j

Conditional Independence

Naive Bayes uses assumption that the Xi are conditionally independent, given Y. E.g.,

P(X1|X2,Y) = P(X1|Y)

Given this assumption, then:

P(X1,X2|Y) = P(X1|X2,Y)P(X2|Y)
           = P(X1|Y)P(X2|Y)
in general: P(X1...Xn|Y) = ∏_i P(Xi|Y)

Note that, we reduce the # of parameters from 2^n (n is the parameter # of attributes X) to 2 x n (assume Y is true/false, we do not do this with 2 x 2^n because if y1 has been calculated, the y2 = 1 - y1 in the case of binomial). It is a big progress.

Naive Bayes

In which, y_k is the possible value of Y classes, x_j is the possible value of X attribute, and i is the number of X attributes.

Here is an example in course:

Note that 0.026 and 0.091 is the "score" given assumption that S = 0 / 1. The probability should be calculated by 0.026/(0.026+0.091) and 0.091/(0.026+0.091), which are 0.22 and 0.78 correspondingly.

K-Fold Cross Validation

Idea: train multiple times, leaving out a disjoint subset of data each time for test. Average the test set accuracies.

Partition data into K disjoint subsets
For k=1 to K
    testData = kth subset
    h <- classifier trained on all data except for testData
    accuracy(k) = accuracy of h on testData
FinalAccuracyEstimate = mean of the K recorded testset accuracies

Another way to view Naive Bayes

Another way to view Naive Bayes:

Note that the function above is a linear function, which means the naive bayes classifier is a linear classifier.

Naïve Bayes: classifying text documents

  • Classify which emails are spam?

  • Classify which emails promise an attachment?

How shall we represent text documents for Naïve Bayes?

Bag of Words Classification


  1. The given reading material has the detailed equations: PDF
  2. CMU, Bag of Words Classification