# Machine Learning 06: PAC Theory

Course Notes of Professor Tom Mitchell Machine Learning Course @ CMU, 2017

## Computational Learning Theory

- What general laws constrain inductive learning?
- Want theory to relate
- Number of training examples
- Complexity of hypothesis space
- Accuracy to which target function is approximated
- Manner in which training examples are presented
- Probability of successful learning

## Function Approximation: The Big Picture

Hypothesis: `h(·)`

is the approximation of the unknow target function `f(·)`

, with which we can get `y = f(x) ≈ h(x)`

.

Relations between hypothesis and instance:

- A hypothesis shadows a subset of all instances.
- Giving hypothesis
`H`

and data`D`

. This is always true:`Σ_h P(H=h|D=d) = 1`

.

## Sample Complexity

How many examples suffice to learn target concept:

__If learner proposes instances as queries to teacher?__- learner proposes x, teacher provides f(x)

__If teacher (who knows f(x)) generates training examples?__- teacher proposes sequence {
, … }

- teacher proposes sequence {
__If some random process (e.g., nature) generates____instances, and teacher labels them?__- instances drawn according to P(X)

### Case One

For the first question above: If students choose x, then teacher labels.

Example:

X = points

H = rectangles, classify x positive if inside rectangle

h = if a<x1<b and c<x2<d then Y = 1, else Y = 0

It will give the smallest rectangles as the students can know the label, then grows to the largest ground truth as samples number grows.

After students gets the two "+" as shown in the diagram, students will get the inner rectangle. After the students get the four "—", students will get the outer rectangle. To get the most possible one, the student needs to ask samples outside the inner rectangle, but within the larger rectangle.

**Best case**: learner plays 20 questions: chooses each x so that half the remaining hypotheses label it positive, half neg. —> learn in `log2 |H|`

queries in best case. (like binary search)

### Case Two

For the second question: if teacher proposes instances and labels them.

Example:

X = points

H = rectangles, classify x positive if inside rectangle

h = if a<x1<b and c<x2<d then Y = 1, else Y = 0

How many samples do we need to get the ground truth? *6*

It's natural, as the people giving labels knows the groundtruth, fewer samples are needed to get the rectangle.

- Set of instances X
- Set of hypotheses H = {h: X —> {0,1}}
- Set of possible target functions C = {c: X —> {0,1}}
- Sequence of training instances drawn at random from P(X), teacher provides noise-free label c(x)

`h = argmin_h∈H error_train(h)`

## True Error and Overfitting

### The Error of a Hypothesis

The true error of h is the probability that it will misclassify an example drawn at random from P(X).

### Two Notions of Error

According to the comparison above, we can see that the true error is the difference between estimates and probability distribution of X. The train error is the difference between estimates and training samples.

### Overfitting

Consider a hypothesis h and its

- Error rate over training data: error_train (h)
- True error rate over all data: error_true (h)

We say h overfits the training data if

`error_true (h) > error_train (h)`

Amount of overfitting =

`error_true (h) - error_train (h)`

Yes, we can:

if D was a set of examples drawn from P(X) and __ independent__ of h, then we could use standard statistical confidence intervals to determine that with 95% probability, error_true (h) lies in the interval:

but D is the ** training data** for h …...

### Version Spaces

### Exhausting the Version Space

Definition: The version space VS_H,D with respect to training data D is said to be ϵ-**exhausted** if every hypothesis in VS_H,D has **true** error less than ϵ.

Interesting! This bounds the probablity that __any consistent learner__ will output a hypothesis h with error_true (h) >= ϵ.

Specifically:

```
Hypothesis space H
Target function C(X)
# of training example m
Let h1, h2, ... hk be the hypothesis in H with true error >= ϵ
```

To get a ϵ-exhausted version space, we need to exempt all h1 through hk.

*What's the probability h1 will be consistent with one training example?*

`< (1 - ϵ)`

. Because the prob. that h1 cannot correctly label the first training example is >= ϵ. So the prob. that it can correctly label the first training exmaple is < (1 - ϵ).

*What's the probability that h1 will be consistent with m training examples?*

`< (1- ϵ)^m`

. Since all training examples are independent.

*What's the probability that at least one of h1…hk will be consistent with m examples?*

`< k * (1 - ϵ)^m`

. Since we do not know the independence of h1…hk, we can only give an upper bound by timing k.

Also, we do know that k is less than or equal to the number of hypothesis H: k <= |H|. So we can get:

`< |H|* (1 - ϵ)^m`

.

It is true that: if 0 <= ϵ <= 1, then (1 - ϵ) <= e^(-ϵ). So we proved:

#### Example 1

Note that in the example, X1,…X4 can be value of 0, 1, or "whatever". So the size of H: |H| = 3^4.

Also, here are some statics according to the computation: Giving

#### Example 2

Note that in the example, to construct the tree structure, there are N * (N-1) / 2 possible combinations to get the 2-depth / 2-variable tree. As each leaf is a boolean value, so we get 2^4 combinations. In all, the size of H: |H| = N * (N-1) / 2 * 2^4 = 8N(N-1).

#### Example 3: Depth n Decision Trees

What is the |H| = the number of hypothesis in this case?

Note that the possible combination of varibales are |X| = 2^n. Then the hypothesis numer are: |H| = 2^|X| = 2^(2^n).

Also note that in both examples, the assumption is that we can find the hypothesis with __ zero__ training error.

## Agnostic Learning

So far, assumed c ∈ H, while Agnostic learning setting: don't assume c ∈ H

- What do we want then?
- The hypothesis h that makes fewest errors on training data

- What is sample complexity in this case?

This conclusion can be conducted from:

But, the output `h`

with lowest __training error__ might not give us the `h*`

with lowest true error. How far can true error of `h`

be from `h*`

?

## VC Dimensions

__Answer__: The largest subset of X for which H can __guarantee__ zero training error (regardless of how it is labeled)

— This is the **VC dimension of H**

### Shattering a Set of Instances

__Definition__: a **dichotomy** of a set S is a partition of S into two disjoint subsets.

__Definition__: a set of instances S is shattered by hypothesis space H if and only if for every dichotomy of S there exists some hypothesis in H consistent with this dichotomy.

__Equivalently Definition__: If H can guarantee zero train error regardless of how they are labeled.

For example:

Assuming ⊙ is a boolean variable in S the right diagram. Then there are eight possible dichotomy of S as circled. No matter how these three ⊙ are labeled, it is matches certain hypothesis in H. In this way, the Hypothesis space H with these eight hypotheses is shattering S.

### The Vapnik-Chervonenkis Dimension

__Definition__: The **Vapnik-Chervonenkis dimension**, VC(H), of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H. If arbitrarily large sets of X can be shattered by H, then VC(H) ≡ ∞.

In the diagram above, the VC(H) = 3.

### Sample Complexity based on VC dimension

How many randomly drawn examples suffice to ε-exhaust VS_H,D with probability at least (1-δ)?

ie., to guarantee that any hypothesis that perfectly fits the training data is probably (1-δ) approximately (ε) correct.

Compare to our earlier results based on H:

### VC Dimension Example 1:

Consider X = set of real numbers, want to learn c: X —> {0, 1}. What is the VC dimension of:

- Open intervals:
- H1: if x > a then y = 1 else y = 0
*VC(H1) = 1*. For example, if given two variables in X, this case is not satisfying the hypothesis. ———(x1 = 1)———(x2 = 0)———>- H2: if x > a then y = 1 else y = 0 or, if x> a then y = 0 else y = 1
*VC(H2) = 2*. For the two points shown below, whatever they are (1, 1) or (1, 0) or (0, 1) or (0, 0), we can always find a that suits the hypothesis!

- Closed intervals:
- H3: if a < x < b then y = 1 else y = 0
*VC(H3) = 2*. Imagine we got three points (1, 0, 1), then at most two points can fit.- H4: if a < x < b then y = 1 else y = 0 or, if a < x < b then y = 0 else y = 1
*VC(H4) = 3.*

### VC Dimension Example 2:

What is VC dimension of lines in a plane?

H2 = { ((w0 + w1x1 + w2x2) > 0 —> y=1) }

See the 3 points on the left, there are 8 possible combinations. For any one of the combinations, we can always find a line to seperate them perfectly.

See the 3 points on the right. We found there is no possible line to seperate (0, 1, 0).

But the VC(H2) = 3. Because according to the definition, we just want to find the maximum number of possible variables.

**For H_n = linear separating hyperplanes in n dimensions, VC(Hn) = n + 1.**

Also note that this is exactly logistic regression so this conclusion also suits for logistic regression!!!

__Question__: For any finite hypothesis space H, can you give an upper bound on VC(H) in term of |H|?

__Answer__: *Yes!* Imagine VC(H) = k. —> can guarantee zero training error for k variables. —> H has at least 2^k hypothesis. So: VC(H) <= log2(|H|)

## Tightness of Bounds on Sample Complexity

How many examples m suffice to assure that any hypothesis that fits the training data perfectly is probably (1-δ) approximately (ε) correct?

How tight is this bound?

Note that the equation above gives the suffice upper bound saying that how many examples are enough (the worst case!!). But we can also get a lower bound saying how many examples are not enough (the best case!!!)

**Lower bound on sample complexity** (Ehrenfeucht et al., 1989):

Consider any class C of concepts such that VC(C) > 1, any learner L, any 0 < ε < 1/8, and any 0 < δ < 0.01. Then there exists a distribution D = P(x) and a target concept in C, such that if L observes fewer examples than:

Then with probability at least δ, L outputs a hypothesis with error_D(h) > ε.

## Agnostic Learning: VC Bounds

Scholkopf and Smola, 2002

With probability at least (1- δ) every h ∈ H satisfies

Note that the right half part is the upper bound of overfitting. What if we double the number of training examples? The overfitting decreases! The line of train error and true error (test error) will get closer to each other.

## What you should know about PAC

- Sample complexity varies with the learning setting
- Learner actively queries trainer
- Examples arrive at random

- Within the PAC learning setting, we can bound the probability that learner will output hypothesis with true error within ε of training error
- For ANY consistent learner (case where c ∈ H, zero training error)
- For ANY “best fit” hypothesis (agnostic learning, where perhaps c not in H)

- VC dimension as measure of complexity of H
- More results coming this semester
- Conference on Learning Theory
- ML Department course on Machine Learning Theory