Loading [MathJax]/jax/output/CommonHTML/jax.js

Computational Learning Theory

Computational Learning Theory gives us a formal way of addressing three important questions:

  1. Definition of a learning problem,
  2. Showing specific algorithms work, and
  3. Show some problems are fundamentally hard (or unsolvable).
  • The tools and analyses that are used in learning theory are the same tools and analyses that are used in computing:
    • Theory of computing analyzes how algorithms use resources like time and space, specifically, algorithms may be O(nlogn) or O(n2), for example.
    • In computational learning theory, algorithms use time, space, and samples (or data, examples, etc.)

Defining Inductive Learning

  1. Probability of successful training:
    • 1δ
  2. Number of examples to train on:
    • m
  3. Complexity of hypothesis class, or,
    • complexity of H
  4. Accuracy to which target concept is approximated:
    • ϵ
  5. Manner in which training examples are presented:
    • batch / online
  6. Manner in which training examples are selected:
    • Learner asks questions of teacher
      • c(x) = ?
    • Teacher gives examples to help learner
      • Teacher chooses x, tells c(x)
    • Fixed distribution
      • x chosen from D by nature
    • Worst possible distribution

20 Questions Example

Consider the teaching via 20 Questions example
where:

  • H: set of possible answers
  • X: set of questions
  • Assume the teacher knows the answer, but the learner does not

If the teacher chooses the question, X, then it may take as few as 1 questions to determine the answer.

If the learner chooses the question, X, then it will take log2|H| questions to determine the answer.

Teacher with Constrained Queries

x: x1,x2,,xk, k-bit inputs
h: conjunctions of literals or negation, for example:

h:x1 and x3 and ¯x5

In the following examples, the H are given as outputs of this expression.

It is also possible to pose questions about moving in the other direction. For example, given these outputs, what is the exact hypothesis?

Approaching this requires a two-stage process:

  1. Show which parameters are irrelevant (in the example above, x1 and x3 were irrelevant, because they were inconsistent despite a consistent output).
  2. Show which parameters are relevant (in the example above, x2, x4, and x5 are all relevant).

The specific hypothesis is:

¯x2 and x4 and ¯x5

There are 3k possible hypotheses, which could be learned with klog23, which is linear in k.

Learner with Constrained Queries

In the example of a learner that does not have a teacher to choose questions to help teach it the hypothesis, it becomes necessary for the learner to ask all possible data points before finding a hypothesis that outputs 1. In particular, it could take as many as 2k queries.

Learner with Mistake Bands

The mistake bands learning formalism involves the following loop:

  1. Input arrives,
  2. Learner guesses answer,
  3. If wrong answer, add to a mistake counter,
  4. Repeat.

The total number of mistakes is limited to some value.

A more precise algorithm would involve: 1. Assume each variable X is present in the hypothesis in both its positive and negative forms (IE, x1 and ¯x1 and ), 2. Given input, compute output, 3. If wrong, set all positive variables that were 0 to “absent,” negative variables that were 1 to absent. 4. Go to step 2.

In this schema, the learner would never make more than K+1 mistakes.

Definitions

Computational Complexity: How much computational effort is needed for a learner to converge Sample Complexity (batch): How many training examples are needed for a teacher to create a successful hypothesis Mistake bounds (online): How many misclassifications can a learner make over an infinite run

Recall that there are three ways of choosing the inputs to the learner: learner chooses, teacher chooses, nature chooses, and worst possible distribution. The nature chooses way of learning will be considered next. It is unique in that it requires you to take into account the space of possible distributions.

Hypothesis space: H,
True hypothesis: cH, also called a concept
Training set: SX, a subset of the possible inputs, along with c(x)  xS
Candidate hypothesis: hH
Consistent learner: produces c(x)=h(x) for xS, a consistent learners is able to learn the true hypothesis. It is called consistent because it is consistent with the data.
Version space: VS(S) = {h | h  H consistent with S}, IE, “the version space for a given set of data, S, is the set of hypotheses that are in the hypothesis set such that they are consistent with respect S

Version Space Example

An example target concept is shown below. It is the XOR function.

The training data, S is shown below.

The hypothesis set, H, is given below.

H={x1,¯x1,x2,¯x2,T,F,OR,AND,XOR,EQUIV}

Which elements in H are in the version space, VS?

VS={x1,OR,XOR}

PAC Learning

Training Error: fraction of training examples misclassified by h.
True Error: fraction of examples that would be misclassified on sample drawn from D.

The error of h is defined as follows. This formula captures the notion that it is okay to misclassify examples that will be seen rarely see. The penalty is in proportion to the probability of getting it wrong.

errorD(h)=PrxD[c(x)h(x)]

PAC stands for Probably Approximately Correct, where probably is a stand-in for 1δ, approximately is a stand-in for ϵ, and correct is a stand-in for errorD(h)=0

C: concept class
L: learner
H: hypothesis space
n: |H|, size of hypothesis space
D: distribtion over inputs
0ϵ12: Error Goal, error should be less than ϵ
0δ12: Certainty Goal, certainty is (1δ)

PAC-learnable: a concept, C, is PAC-learnable by L using H if and only if learner, L will, with probability (1δ), output a hypothesis hH such that errorD(h)ϵ in time and samples polynomial in {1ϵ,1δ,n}

ϵ-exhausted version space: VS(S) is ϵ-exhausted iff  h VS(S) errorD(h)ϵ. In other words, a version space is called ϵ-exhausted exactly in the case when everything that might be chosen has an error less than ϵ.

ϵ-exhausted Version Space Example

Given the following training set and associated distributions, find the smallest ϵ such that we have ϵ-exhausted the version space.

The hypothesis space is H={x1,¯x1,x2,¯x2,T,F,OR,AND,XOR,EQUIV}. The target concept is XOR and the training set are the inputs in the green boxes. The distribution, D, gives the probability of drawing each of these input examples. Recall from the previous question that the version space is VS={x1,OR,XOR}

Version Space True Error
x1 0.5
OR 0.0
XOR 0.0

The input x1 has a true error of 0.5 because half the time it will be presented with the second training set, which it will get incorrect. Both OR and XOR will not give an error. So, the minimum ϵ over which the version space is exhausted is the max true error for the VS, or 0.5.

Haussler Theorem - Bound True Error

Let errorD(h1,,hkH)ϵ. These hypotheses have “high true error.” How much data do we need to establish that these hypotheses are “high error?”

The following formula says that the probability that if you draw an input, x from the distribution, D, the “bad” hypotheses will match the true concept is less than or equal to 1ϵ. IE, there is a low probability of match.

PrXD(hi(x)=c(x))1ϵ

Pr(hi remains consistent with c on m examples)(1ϵ)m

Pr(at least one of h1, , hk consistent with c on m examples)

k(1ϵ)m|H|(1ϵ)m

Using the fact that ϵlen(1ϵ) we get (1ϵ)meϵm. Then |H|(1ϵ)m can be rewritten as

|H|eϵmδ

This is an upper bound that version space not epsilon-exhausted after m samples. Recall that δ is a failure probability. Rewriting this in terms of m:

ln|H|ϵmlnδ

m1ϵ(ln|H|+ln1δ)

This is called the sample complexity bound and it is important in that it depends polynomially on the target error (ϵ) bound, the size of the hypothesis space (|H|), and the failure probability (δ).

PAC-Learnable Example

H={hi(x)=xi}, where x is 10 bits

So, the hypothesis space is the set of functions that take 10 bit inputs and return each of those 10 bits. IE, one returns the first bit, one returns the second, etc.

ϵ=0.1,δ=0.2, D is uniform

That is to say that we want to return a hypothesis whose error is 0.1, and we want the failure probability to be less than 0.2, and the input distribution is uniform. How many samples do we need to PAC learn this hypothesis set?

m1.1(ln10+ln1.2)39.12 or 40 samples

Note that this calculation is agnostic to the nature of the distribution.

Final Notes and Limitations

  • If the target concept is not in the hypothesis space, the learning scenario is called “agnostic,” and the formula for the sample bound must be calculated differently. Generally speaking, in this scenario, more samples are required.
  • If the size of the hypothesis space, |H|, is infinite, then the formula for the sample bound breaks down. This scenario will be discussed in a separate lecture.

These notes are continued in my notes on VC Dimensions.