I am
Wang Hongyang

Wechat Official Account
Find fun things here!

Machine Learning 01: Decision Tree, Entropy, and Overfitting

Course Notes of Professor Tom Mitchell Machine Learning @ CMU

Machine Learning is the study of algorithms that:

  • Improve their performance P
  • At some task T
  • With experience E

The well-defined learning task: <P,T,E>.

Function Approximation and Decision Tree Learning

[Wikipedia] A decision tree is a flowchart-like structure in which each internal node represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decisiontaken after computing all attributes).

Problem Setting:

  • Set of possible instances X
  • Unknown target function f : X->Y
  • Set of candidate hypotheses H={ h | h : X->Y }


  • Training examples {<x^(i), y^(i)>} of unknown target function f


  • Hypothesis h ∈ H that best approximates target function f

Note that decision tree:

  • Y must be discrete values.
  • X each instance x in X is a feature vector x = < x1, x2 … xn>.
  • Each f get is a tree.

Decision Tree Learning

How to generate a decision tree?

A top-down induction of decision trees given by [ID3, C4.5, Quinlan] is a greedy method:

Main loop:

  1. A <- the "best" decision attribute for the next node
  2. Assign A as decision attribute for node
  3. For each value of A, create new descendant of node
  4. Sort training examples to leaf nodes
  5. If training examples perfectly classified, Then STOP, Else iterate over new leaf nodes

So how can we know the selection of attribute A is the "best" one?

The current known best solution is by calculating Entropy.

Entropy is measuring how ununiformed the data are. Here is an example.

The formula to calculate Entropy is:

Note that the n in the function is the # of possible values for X.

Also, H(X) is the expected number of bits needed to encode a randomly drawn value of X (under most efficient code).

The thrid formula above is the expectation of the conditional entropy.

When using the theory of Entropy and Information Gain into deciding attribute choice for constructing a decision tree, we should choose the the attribute that decrease the ununiform of labels most.

We specify Information Gain here as the mutual information I(A,Y) between node's attribute Y and target label variable Y.

Information gain is the expected reduction in entropy of target label variable Y for data sample S, due to sorting on attribute variable A.

For the left tree above, we can compute the Entropies of parent node: [29+,35-] as well as children nodes: [21+,5-],[8+,30-]. Then substract the chidren's entropy from the parent's finally we get the Information Gain of choosing attribute A1 as the attribute for the parent node. By the same way, we can calculate the Information Gain of choosing A2 and decide to take the attribute that has the largest Information Gain to construct the tree ASAP.

The left one, humidity.

A question that there are multiple trees that can perfectly fit our needs of classification, which one should we use?

A pratice is to use Occam's razor: prefer the simplest hypothesis that fits the data.

Why the simplest is the best in most cases?

An interesting idea mentioned in Machine Learning book by [Zhihua Zhou] is that: the simplest may be not the best, the match one is the best. Actually, the different models got by machine learning training are corresponding to corresponding inductive preferences, that is, the algorithm thinks which model is a better fit for data. In practice, the model with inductive preference that fits the question itself has better results. So it is meaningless saying an algorithm is good without giving specific problem.

Another point in the book is also interesting: even Occam's razor itself is not easy to use: there are situations that two different models have exactly the same number of parameters, then which one is better?

Big Map

Assume there are n instances x in input vector X, and each instance represent a boolean (mean only two possible values). The output value Y is also a boolean. So how many possible number of distinct X can we get? And *how many number (the upper bound) of possible decision trees (hypothesis) that fits the data can we get?*

2^n, and 2^(2^n).

This is a hypothesis H = {f: X->Y} in big map. It is important and needs to be remembered.

Why should we choose the simplest tree? Or why Occam's razor works?

For big picture:

Evolution takes care of it. Tom's hypothesis of evolution building smart brains: Image that evolution wants to build brain very smart, but it just has limit of the brain size, so it cannot develop very complex sensors for us to use. Since the "simple" is the thing that have to be, then the combination of simple sensors should have the ability to represent the complex sensors, then building smart brain with limit space can be possible. It's kind of a self-fulfillment.

Overfitting in Decision Tree

Consider adding noisy training example #15:

Sunny, Mild, Normal, Strong, PlayTennis = No

What will happen on earlier trees?


Consider a hypothesis h and its

  • Error rate over training data: error_train(h)
  • True error rate over all data: error_true(h)

We say h overfits the training data if error_true(h) > error_train(h)

Amount of overfitting = error_true(h) - error_train(h)

Avoid Overfitting

How can we avoid overfitting?

  • Stop growing when data split not statistically significant
  • grow full-tree, then prune the tree

Reduced-Error Pruning

  1. Split data into training and validation set
  2. Learn a tree that classifies training set correctly
  3. Do until further pruning is harmful:

    1. For each non-leaf node, evaluate impact on validation set of converting it to a leaf node
    2. Greedily select the node that would most improve validation set accuracy, and convert it to a leaf

      • this produces smallest version of most accurate (over the validation set) subtree

The pruning process even more supports that the small trees are better than the large trees.

Decision Forests

Key idea:

  1. learn a collection of many trees
  2. classify by taking a weighted vote of the trees

Empirically successful. Widely used in industry.

  • human pose recognition in Microsoft kinect
  • medical imaging – cortical parcellation
  • classify disease from gene expression data

How to train different trees?

  1. Train on different random subsets of data
  2. Randomize the choice of decision nodes

This methodology has good effect of overcomes overfitting.

You should know

  • Well posed function approximation problems:
    • Instance space, X
    • Sample of labeled training data {}
    • Hypothesis space, H = { f: X->Y }
  • Learning is a search/optimization problem over H
    • Various objective functions to define the goal
    • minimize training error (0-1 loss)
    • minimize validation error (0-1 loss)
    • among hypotheses that minimize error, select smallest (?)
  • Decision tree learning
    • Greedy top-down learning of decision trees (ID3, C4.5, ...)
    • Overfitting and post-pruning
    • Extensions… to continuous values, probabilistic classification
    • Widely used commercially: decision forests

Probablistic Function Approximation

Instead of F: X -> Y

Learn: P(Y|X)

Informally, A is a random variable if

  • A denotes something about which we are uncertain
  • perhaps the outcome of a randomized experiment


  • A = True if a randomly drawn person from our class is female
  • A = The hometown of a randomly drawn person from our class
  • A = True if two randomly drawn persons from our class have same birthday

Define P(A) as “the fraction of possible worlds in which A is true” or “the fraction of times A holds, in repeated runs of the random experiment”

  • the set of possible worlds is called the sample space, S
  • A random variable A is a function defined over S, A: S -> {0,1}

More formally, we have

  • a sample space S (e.g., set of students in our class)
    • aka the set of possible worlds
  • a random variable is a function defined over the sample space
    • Gender: S -> { m, f }
    • Height: S -> Reals
  • an event is a subset of S
    • e.g., the subset of S for which Gender=f
    • e.g., the subset of S for which (Gender=m) AND (Height > 2m)
  • we’re often interested in probabilities of specific events
  • and of specific events conditioned on other specific events

The Axioms of Probability

  • 0 <= P(A) <= 1
  • P(Ture) = 1
  • P(False) = 0
  • P(A or B) = P(A) + P(B) - P(A and B)

The Chain Rule:

P(A ^ B) = P(A|B) P(B)


  1. [Slides] Introduction, Decision trees, Entropy, Mutual information
  2. [Slides] The big picture, Overfitting, Probabilistic learning
  3. [PDF] Machine Learning: Trends, Perspectives and Prospects. Jordan and Mitchell
  4. [PDF] Estimating Probabilities, Mitchell