The VC dimension.

Machine Learning Classification

A quick explanation of the VC dimension.

Author

Affiliation

Andrea Bonvini

 

Published

Feb. 13, 2021

Citation

Bonvini, 2021

When talking about binary classification, an hypothesis is a function that maps an input from the entire input space to a result: h:X{1,+1}

The number of hypotheses |H| can be infinite.

A dichotomy is a hypothesis that maps from an input from the sample size to a result:

h:{x1,x2,,xN}{1,+1}

The number of dichotomies |H(x1,x2,,xN)| is at most 2N, where N is the sample size.

e.g. for a sample size N=3 we have at most 8 possible dichotomies:

        x1 x2 x3
1       -1 -1 -1
2       -1 -1 +1
3       -1 +1 -1
4       -1 +1 +1
5       +1 -1 -1 
6       +1 -1 +1
7       +1 +1 -1
8       +1 +1 +1

The growth function is a function that counts the most dichotomies on any N points. mH(N)=maxx1,,xNX|H(x1,,xN)|

This translates into choosing any N points and laying them out in any fashion in the input space. Determining m is equivalent to looking for such a layout of the N points that yields the most dichotomies.

The growth function satisfies: mH(N)2N

This can be applied to the perceptron. For example, when N=4, we can lay out the points so that they are easily separated. However, given a layout, we must then consider all possible configurations of labels on the points, one of which is the following:

This is where the perceptron breaks down because it cannot separate that configuration, and so mH(4)=14 because two configurations—this one and the one in which the left/right points are blue and top/bottom are red—cannot be represented. For this reason, we have to expect that for perceptrons, m can’t be 24.

The VC ( Vapnik-Chervonenkis ) dimension of a hypothesis set H , denoted by dVC(H) is the largest value of N for which mH(N)=2N , in other words is “the most points H can shatter

We can say that the VC dimension is one of many measures that characterize the expressive power, or capacity, of a hypothesis class.

You can think of the VC dimension as “how many points can this model class memorize/shatter?” (a ton? BAD! not so many? GOOD!).

With respect to learning, the effect of the VC dimension is that if the VC dimension is finite, then the hypothesis will generalize:

dvc(H)  gH will generalize 

The key observation here is that this statement is independent of:

The only things that factor into this are the training examples, the hypothesis set, and the final hypothesis.

The VC dimension for a linear classifier (i.e. a line in 2D, a plane in 3D etc…) is d+1 (a line can shatter at most 2+1=3 points, a plane can shatter at most 3+1=4 points etc…)

Proof: here

How many randomly drawn examples suffice to guarantee error of at most ϵ with probability at least (1−δ)?

N1ϵ(4log(2δ)+8VC(H)log2(13ϵ))

PAC BOUND using VC dimension: Ltrue(h)Ltrain(h)+VC(H)(ln2NVC(H)+1)+ln4δN

Footnotes

    Citation

    For attribution, please cite this work as

    Bonvini (2021, Feb. 14). Last Week's Potatoes: The VC dimension.. Retrieved from https://lastweekspotatoes.com/posts/2021-07-22-the-vc-dimension/

    BibTeX citation

    @misc{bonvini2021the,
      author = {Bonvini, Andrea},
      title = {Last Week's Potatoes: The VC dimension.},
      url = {https://lastweekspotatoes.com/posts/2021-07-22-the-vc-dimension/},
      year = {2021}
    }