# Machine Learning 01: Decision Tree, Entropy, and Overfitting

Course Notes of Professor Tom Mitchell Machine Learning @ CMU

Machine Learning is the study of algorithms that:

- Improve their performance P
- At some task T
- With experience E

The well-defined learning task: `<P,T,E>`

.

## Function Approximation and Decision Tree Learning

[Wikipedia] A **decision tree** is a flowchart-like structure in which each internal **node** represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf **node** represents a class label (**decision**taken after computing all attributes).

- Set of possible instances
`X`

- Unknown target function
`f`

:`X->Y`

- Set of candidate hypotheses
`H={ h | h : X->Y }`

- Training examples
`{<x^(i), y^(i)>}`

of unknown target function`f`

- Hypothesis
`h ∈ H`

that best approximates target function`f`

Note that decision tree:

`Y`

must be discrete values.`X`

each instance`x`

in`X`

is a feature vector`x = < x1, x2 … xn>`

.- Each
`f`

get is a tree.

*How to generate a decision tree?*

A top-down induction of decision trees given by [ID3, C4.5, Quinlan] is a greedy method:

Main loop:

`A`

<- the "best" decision attribute for the next`node`

- Assign
`A`

as decision attribute for`node`

- For each value of
`A`

, create new descendant of`node`

- Sort training examples to leaf nodes
- If training examples perfectly classified, Then STOP, Else iterate over new leaf nodes

The current known best solution is by calculating **Entropy**.

Entropy is measuring *how ununiformed the data are*. Here is an example.

The formula to calculate Entropy is:

Note that the `n`

in the function is the # of possible values for `X`

.

Also, ** H(X) is the expected number of bits needed to encode a randomly drawn value of X** (under most efficient code).

The thrid formula above is the expectation of the conditional entropy.

When using the theory of **Entropy** and **Information Gain** into deciding attribute choice for constructing a decision tree, we should choose the the attribute that decrease the ununiform of labels most.

We specify Information Gain here as the mutual information `I(A,Y)`

between node's attribute `Y`

and target label variable `Y`

.

Information gain is the expected reduction in entropy of target label variable `Y`

for data sample `S`

, due to sorting on attribute variable `A`

.

For the left tree above, we can compute the Entropies of parent node: `[29+,35-]`

as well as children nodes: `[21+,5-],[8+,30-]`

. Then substract the chidren's entropy from the parent's finally we get the Information Gain of choosing attribute `A1`

as the attribute for the parent node. By the same way, we can calculate the Information Gain of choosing `A2`

and decide to take the attribute that has the largest Information Gain to construct the tree ASAP.

The left one, humidity.

A pratice is to use Occam's razor: prefer the simplest hypothesis that fits the data.

An interesting idea mentioned in Machine Learning book by [Zhihua Zhou] is that: the simplest may be not the best, the match one is the best. Actually, the different models got by machine learning training are corresponding to corresponding inductive preferences, that is, the algorithm thinks which model is a better fit for data. In practice, the model with inductive preference that fits the question itself has better results. So it is meaningless saying an algorithm is good without giving specific problem.

Another point in the book is also interesting: even Occam's razor itself is not easy to use: there are situations that two different models have exactly the same number of parameters, then which one is better?

## Big Map

Assume there are `n`

instances `x`

in input vector `X`

, and each instance represent a boolean (mean only two possible values). The output value `Y`

is also a boolean. So *how many possible number of distinct X can we get?* And *how many number (the upper bound) of possible decision trees

**(hypothesis)**that fits the data can we get?*

**2^n, and 2^(2^n).**

This is a **hypothesis H = {f: X->Y}** in big map. It is

**important**and needs to be remembered.

For big picture:

Evolution takes care of it. Tom's hypothesis of evolution building smart brains: Image that evolution wants to build brain very smart, but it just has limit of the brain size, so it cannot develop very complex sensors for us to use. Since the "simple" is the thing that have to be, then the combination of simple sensors should have the ability to represent the complex sensors, then building smart brain with limit space can be possible. **It's kind of a self-fulfillment**.

## Overfitting in Decision Tree

Consider adding noisy training example #15:

**Overfitting.**

Consider a hypothesis `h`

and its

- Error rate over training data:
`error_train(h)`

- True error rate over all data:
`error_true(h)`

We say `h`

overfits the training data if `error_true(h) > error_train(h)`

Amount of overfitting = `error_true(h) - error_train(h)`

### Avoid Overfitting

- Stop growing when data split not statistically significant
- grow full-tree, then prune the tree

### Reduced-Error Pruning

- Split data into
**training and validation set** - Learn a tree that classifies training set correctly
- Do until further pruning is harmful:
- For each non-leaf node, evaluate impact on validation set of converting it to a leaf node
- Greedily select the node that would most improve validation set accuracy, and convert it to a leaf
- this produces smallest version of most accurate (over the validation set) subtree

The pruning process even more supports that the small trees are better than the large trees.

## Decision Forests

**learn a collection of many trees****classify by taking a weighted vote of the trees**

Empirically successful. Widely used in industry.

- human pose recognition in Microsoft kinect
- medical imaging – cortical parcellation
- classify disease from gene expression data

- Train on different random subsets of data
- Randomize the choice of decision nodes

This methodology has **good effect of overcomes overfitting.**

## You should know

- Well posed function approximation problems:
- Instance space,
`X`

- Sample of labeled training data {
} - Hypothesis space,
`H = { f: X->Y }`

- Instance space,
- Learning is a search/optimization problem over
`H`

- Various objective functions to define the goal
- minimize training error (0-1 loss)
- minimize validation error (0-1 loss)
- among hypotheses that minimize error, select smallest (?)

- Decision tree learning
- Greedy top-down learning of decision trees (ID3, C4.5, ...)
- Overfitting and post-pruning
- Extensions… to continuous values, probabilistic classification
- Widely used commercially: decision forests

## Probablistic Function Approximation

Instead of

`F: X -> Y`

Learn:

`P(Y|X)`

Informally, `A`

is a **random variable** if

`A`

denotes something about which we are uncertain- perhaps the outcome of a randomized experiment

Examples

- A = True if a randomly drawn person from our class is female
- A = The hometown of a randomly drawn person from our class
- A = True if two randomly drawn persons from our class have same birthday

Define `P(A)`

as “the fraction of possible worlds in which `A`

is true” or “the fraction of times `A`

holds, in repeated runs of the random experiment”

- the set of possible worlds is called the
**sample space**,`S`

- A random variable
`A`

is a function defined over`S`

,`A: S -> {0,1}`

More formally, we have

- a
__sample space__`S`

(e.g., set of students in our class)- aka the set of possible worlds

- a
is a function defined over the sample space__random variable__- Gender: S -> { m, f }
- Height: S -> Reals

- an
is a subset of S__event__- e.g., the subset of S for which Gender=f
- e.g., the subset of S for which (Gender=m) AND (Height > 2m)

- we’re often
**interested in probabilities of specific events** - and of
**specific events conditioned on other specific events**

### The Axioms of Probability

- 0 <= P(A) <= 1
- P(Ture) = 1
- P(False) = 0
- P(A or B) = P(A) + P(B) - P(A and B)

The Chain Rule: