# Machine Learning 02: MLE, MAP, and Naïve Bayes

- Probablistic learning
- Bayes Rule
- Joint Probability Distribution
- MLE and MAP
- MLE - Maximum Likelihood Estimation
- MAP - Maximum a Posteriori Probability Estimation
- Classifiers based on probability
- Naive Bayes
- K-Fold Cross Validation
- Another way to view Naive Bayes
- Naïve Bayes: classifying text documents
- Resources

Course Notes of Professor Tom Mitchell Machine Learning @ CMU

## Probablistic learning

Probablistic function approximation:

Definition of conditional probability:

The chain rule

## Bayes Rule

we call P(A) the “prior” and P(A|B) the “posterior”

Other forms of **Bayes Ruls**

E.g. of using Bayes:

A = you have a flu, B = you just coughed

Assume:

P(A) = 0.05, P(B|A) = 0.80, P(B|~A) = 0.20

P(A|B) = P(B|A)P(A) / P(B) = P(B|A)P(A) / (P(B|A)P(A) + P(B|~A)P(~A)) = 0.8 x 0.05 / (0.8 x 0.05 + 0.2 x 0.95)

## Joint Probability Distribution

The awesome Joint Probability Distribution: `P(X1,X2,...,Xn)`

from which we can calculate `P(X1|X2...Xn)`

and every other probability we desire over subsets of `X1…Xn`

- Make a table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows).
- For each combination of values, say how probable it is.
- If you subscribe to the axioms of probability, those numbers must sum to 1.

**Once you have the Joint Distribution, you can ask for the probability of any logical expression involving these variables.** E.g. To know P(A|B), we can get the data of P(A^B), P(B) from JD data table and calculate P(A|B) by conditional probability rule.

But the main problem: learning P(Y|X) can require more data than we have. E.g. JD with 100 attributes, then there will be 2^100 # of rows in the table.

- Be smart about how we estimate probabilities from sparse data
- maximum likelihood estimates
- maximum a posteriori estimates

- Be smart about how to represent joint distributions
- Bayes networks, graphical models, conditional independencies

## MLE and MAP

E.g.

- Flip a coin X, and ask you to estimate the probability that it will turn up heads (X = 1) or tails (X = 0).
- You flip it repeatedly, observing
- It turns up heads α1 times
- It turns up tails α0 times

But this will be not accurate if the training data is scarce. So we need MAP in this case.

When data sparse, might bring in **prior assumptions** to bias our estimate

- e.g., represent priors by “hallucinating” γ1 heads, and γ0 tails, to complement sparse observed α1, α0

## MLE - Maximum Likelihood Estimation

## MAP - Maximum a Posteriori Probability Estimation

In the case of flipping coin, the data is generated by mutiple i.i.d. draws of a **Bernoulli** random variable. So our prior assumption of the distribution is the **Beta Distribution** give the probability distribution function as following:

Here, `β1 - 1 = γ1`

as we mentioned before.

We say P(θ) is the conjugate prior for P(D|θ) if P(D|θ) has the same form as P(θ)

## Classifiers based on probability

E.g.

Get probability of wealth / poor by other data:

P(Y|X1, X2), the wealth probability can be calculated by given 4 lines of data.

*What if we have 100 X parameters instead of 2?*

We will need `2^n`

lines of data to calculate P(Y|X1…X100)

By Bayes Rule:

`P(Y|X)=P(X|Y)P(Y)/P(X)`

And Naive Bayes, which assumes:

```
P(X1...Xn|Y) = ∏_i P(Xi|Y)
i.e., that Xi and Xj are conditionally independent given Y, for all i≠j
```

### Conditional Independence

Naive Bayes uses assumption that the Xi are conditionally independent, given Y. E.g.,

`P(X1|X2,Y) = P(X1|Y)`

Given this assumption, then:

```
P(X1,X2|Y) = P(X1|X2,Y)P(X2|Y)
= P(X1|Y)P(X2|Y)
in general: P(X1...Xn|Y) = ∏_i P(Xi|Y)
```

Note that, we reduce the # of parameters from `2^n`

(n is the parameter # of attributes X) to `2 x n`

(assume Y is true/false, *we do not do this with 2 x 2^n because if y1 has been calculated, the y2 = 1 - y1 in the case of binomial*). It is a big progress.

## Naive Bayes

In which, `y_k`

is the possible value of Y classes, `x_j`

is the possible value of X attribute, and i is the number of X attributes.

Here is an example in course:

Note that 0.026 and 0.091 is the "score" given assumption that S = 0 / 1. The probability should be calculated by 0.026/(0.026+0.091) and 0.091/(0.026+0.091), which are 0.22 and 0.78 correspondingly.

## K-Fold Cross Validation

Idea: train multiple times, leaving out a disjoint subset of data each time for test. Average the test set accuracies.

```
Partition data into K disjoint subsets
For k=1 to K
testData = kth subset
h <- classifier trained on all data except for testData
accuracy(k) = accuracy of h on testData
end
FinalAccuracyEstimate = mean of the K recorded testset accuracies
```

## Another way to view Naive Bayes

Another way to view Naive Bayes:

Note that the function above is a linear function, which means the **naive bayes classifier is a linear classifier**.

## Naïve Bayes: classifying text documents

- Classify which emails are spam?
- Classify which emails promise an attachment?

## Resources

- The given reading material has the detailed equations: PDF
- CMU, Bag of Words Classification