Computational Learning Theory
Computational Learning Theory gives us a formal way of addressing three important questions:
- Definition of a learning problem,
- Showing specific algorithms work, and
- Show some problems are fundamentally hard (or unsolvable).
- The tools and analyses that are used in learning theory are the same tools and analyses that are used in computing:
- Theory of computing analyzes how algorithms use resources like time and space, specifically, algorithms may be
or , for example. - In computational learning theory, algorithms use time, space, and samples (or data, examples, etc.)
- Theory of computing analyzes how algorithms use resources like time and space, specifically, algorithms may be
Defining Inductive Learning
- Probability of successful training:
- Number of examples to train on:
- Complexity of hypothesis class, or,
- complexity of
- complexity of
- Accuracy to which target concept is approximated:
- Manner in which training examples are presented:
- batch / online
- Manner in which training examples are selected:
- Learner asks questions of teacher
- c(x) = ?
- Teacher gives examples to help learner
- Teacher chooses x, tells c(x)
- Fixed distribution
chosen from by nature
- Worst possible distribution
- Learner asks questions of teacher
20 Questions Example
Consider the teaching via 20 Questions example
where:
: set of possible answers : set of questions- Assume the teacher knows the answer, but the learner does not
If the teacher chooses the question,
If the learner chooses the question,
Teacher with Constrained Queries
x:
h: conjunctions of literals or negation, for example:
In the following examples, the
It is also possible to pose questions about moving in the other direction. For example, given these outputs, what is the exact hypothesis?
Approaching this requires a two-stage process:
- Show which parameters are irrelevant (in the example above,
and were irrelevant, because they were inconsistent despite a consistent output). - Show which parameters are relevant (in the example above,
, , and are all relevant).
The specific hypothesis is:
There are
Learner with Constrained Queries
In the example of a learner that does not have a teacher to choose questions to help teach it the hypothesis, it becomes necessary for the learner to ask all possible data points before finding a hypothesis that outputs 1. In particular, it could take as many as
Learner with Mistake Bands
The mistake bands learning formalism involves the following loop:
- Input arrives,
- Learner guesses answer,
- If wrong answer, add to a mistake counter,
- Repeat.
The total number of mistakes is limited to some value.
A more precise algorithm would involve:
- Assume each variable
is present in the hypothesis in both its positive and negative forms (IE, ), - Given input, compute output,
- If wrong, set all positive variables that were 0 to “absent,” negative variables that were 1 to absent.
- Go to step 2.
In this schema, the learner would never make more than
Definitions
Computational Complexity: How much computational effort is needed for a learner to converge
Sample Complexity (batch): How many training examples are needed for a teacher to create a successful hypothesis
Mistake bounds (online): How many misclassifications can a learner make over an infinite run
Recall that there are three ways of choosing the inputs to the learner: learner chooses, teacher chooses, nature chooses, and worst possible distribution. The nature chooses way of learning will be considered next. It is unique in that it requires you to take into account the space of possible distributions.
Hypothesis space:
True hypothesis:
Training set:
Candidate hypothesis:
Consistent learner: produces
Version space:
Version Space Example
An example target concept is shown below. It is the
The training data,
The hypothesis set,
Which elements in
PAC Learning
Training Error: fraction of training examples misclassified by h.
True Error: fraction of examples that would be misclassified on sample drawn from D.
The error of h is defined as follows. This formula captures the notion that it is okay to misclassify examples that will be seen rarely see. The penalty is in proportion to the probability of getting it wrong.
PAC stands for Probably Approximately Correct, where probably is a stand-in for
PAC-learnable: a concept,
-exhausted Version Space Example
Given the following training set and associated distributions, find the smallest
The hypothesis space is
Version Space | True Error |
---|---|
0.5 | |
0.0 | |
0.0 |
The input
Haussler Theorem - Bound True Error
Let
The following formula says that the probability that if you draw an input,
Using the fact that
This is an upper bound that version space not
This is called the sample complexity bound and it is important in that it depends polynomially on the target error (
PAC-Learnable Example
So, the hypothesis space is the set of functions that take 10 bit inputs and return each of those 10 bits. IE, one returns the first bit, one returns the second, etc.
That is to say that we want to return a hypothesis whose error is 0.1, and we want the failure probability to be less than 0.2, and the input distribution is uniform. How many samples do we need to PAC learn this hypothesis set?
Note that this calculation is agnostic to the nature of the distribution.
Final Notes and Limitations
- If the target concept is not in the hypothesis space, the learning scenario is called “agnostic,” and the formula for the sample bound must be calculated differently. Generally speaking, in this scenario, more samples are required.
- If the size of the hypothesis space,
, is infinite, then the formula for the sample bound breaks down. This scenario will be discussed in a separate lecture.
These notes are continued in my notes on VC Dimensions.
For Fall 2019, CS 7461 is instructed by Dr. Charles Isbell. The course content was originally created by Dr. Charles Isbell and Dr. Michael Littman.