I am
Wang Hongyang

Wechat Official Account
Find fun things here!

Machine Learning 03: Gaussian Naïve Bayes and Logistic Regression

Course Notes of Professor Tom Mitchell Machine Learning @ CMU

Naïve Bayes with Continuous X

In order to train a Naive Bayes classifier with continuous X, we must estimate the mean and standard deviation of each of these Gaussians:

We must also estimate the priors on Y as well.

Again, we can use either maximum likelihood estimates (MLE) or maximuma posteriori (MAP) estimates for these parameters. The maximum likelihood estimator for μ_ik is:

And the maximum likelihood estimator for σ_ik^2 is:

This maximum likelihood estimator is biased, so the minimum variance unbiased estimator (MVUE) is sometimes used instead. It is:

Gaussian Naive Bayes

considera GNB based on the following modeling assumptions:

  • Y is boolean, governed by a Bernoulli distribution, with parameter π = P(Y = 1)
  • X = ⟨X1 … Xn⟩, where each Xi is a continuous random variable
  • For each Xi, P(Xi|Y = yk) is a Gaussian distribution of the form N(μik,σi)
  • For all i and j ≠ i, Xi and Xj are conditionally independent given Y

Derivation of GNB

Logistic Regression


  • Naïve Bayes allows computing P(Y|X) by learning P(Y) and P(X|Y)
  • Why not learn P(Y|X) directly?

Consider learning f: X -> Y, where:

  • X is a vector of real-valued features, < X1 … Xn >
  • Y is boolean
  • assume all Xi are conditionally independent given Y
  • model P(Xi | Y = yk) as Gaussian N(µik,σi)
  • model P(Y) as Bernoulli (π)

What does that imply about the form of P(Y|X)?

Logistic regression more generally: Softmax

Logistic regression:


Training Logistic Regression: MCLE

For training data D = {, … }

With the MCLE, Maximum Conditional Likelihood Estimate and the conditional probability of Y given X, W, we can get the goal function we want to minimize:

So Gradient Descent is acquired to get the global optimization:

Calculate partial derivative for each w_i, and iterativly optimaze w_i.

The P in the function is the logitic regression of Y. Use the equations above to calculate. This equation is also intuitively making sense: w is optimized to the direction of minimizing the distance between the observed value Y and the expected value P(Y|X, W)

MAP for logistic regression

  • For MAP, need to define prior on W
    • given W = <w1 ... wn> let's assume prior P(w_i) = N(0, σ)
    • i.e., assume zero mean, Gaussian prior for each w_i
  • A kind of Occam's razor (simplest is best) prior
  • Helps avoid very large weights and overfitting

The process of getting MAP is very much like what we've done before.

Note that comparing with MCLE, the only difference is we add a penalty item at the end. Sometimes it is called regularization item, which can help to avoid overfitting because it reduce the speed of gradient descent.

MLE vs MAP in logistic regression

Generative vs. Discriminative Classifiers

Or say, when to use GNB, and when to use Logistic Regression?

Training classifiers involves estimating f: X -> Y, or P(Y|X)

  • Generative classifiers (e.g., Naïve Bayes)
    • Assume some functional form for P(X|Y), P(X)
    • Estimate parameters of P(X|Y), P(X) directly from training data
    • Use Bayes rule to calculate P(Y|X= xi)
    • Number of parameters: 4n + 1
    • Convergence rates faster, because parameters are uncoupled. n = O(log d)
  • Discriminative classifiers (e.g., Logistic regression)
    • Assume some functional form for P(Y|X)
    • Estimate parameters of P(Y|X) directly from training data
    • Number of parameters: n + 1
    • Convergence rates slower, because parameters are coupled. n = O(d)


  1. Ng & Jordan, NB vs. LR, 2002
  2. Naive Bayes and Logistic Regression, Mitchell