Last Week's Potatoes: The Kernel Trick.

Andrea Bonvini

The Kernel Trick.

What is the kernel trick? What’s the main advantage of this technique?

Author

Affiliation

Andrea Bonvini

Published

May 17, 2021

Citation

Bonvini, 2021

Traditionally, theory and algorithms of machine learning and statistics have been very well developed for the linear case. Real world data analysis problems, on the other hand, often require nonlinear methods to detect the kind of dependencies that allow successful prediction of properties of interest. By using a positive definite kernel, one can sometimes have the best of both worlds. The kernel corresponds to a dot product in a usually high-dimensional (possibly infinite) feature space. In this space, our estimation methods are linear, but as long as we can formulate everything in terms of kernel evaluations, we never explicitly have to compute in the high dimensional feature space! (this is called the Kernel Trick)

Suppose we have a mapping that brings our vectors in to some feature space . Then the dot product of and in this space is .

A kernel is a function that corresponds to this dot product, i.e. $k(,)=()^T() $ .

Why is this useful? Kernels give a way to compute dot products in some feature space without even knowing what this space is and what is .

For example, consider a simple polynomial kernel with .

This doesn’t seem to correspond to any mapping function , it’s just a function that returns a real number. Assuming that and , let’s expand this expression:

Note that this is nothing else but a dot product between two vectors:

and

So the kernel computes a dot product in a 6-dimensional space without explicitly visiting this space.

Another example is the Gaussian kernel . If we Taylor-expand this function, we’ll see that it corresponds to an infinite-dimensional codomain of .

Instead, the simplest kernel is the linear kernel which corresponds to an identity mapping in the feature space:

Moreover, the kernel is a symmetric function of its arguments:

Many linear models for regression and classiﬁcation can be reformulated in terms of dual representation in which the kernel function arises naturally ! For example if we consider a linear ridge regression model we know that we obtain the best parameters by minimizing the regularized sum of squares error function (ridge):

Where is the design matrix whose row is (remember that in all the vectors are column vectors) and is the target vector.

Setting the gradient of w.r.t. equal to we obtain the following:

Where is a vector.

We observe that the coefficients are functions of . So our definition of is function of itself…which is surely weird, just wait for it…

We now define the Gram Matrix , an matrix, with elements:

So, given samples, the Gram Matrix is the matrix of all inner products

This will come in handy in a few seconds…

If we substitute into we get

Guess what? we can rewrite the Loss function in terms of the Gram Matrix !

By combining and , setting the gradient w.r.t equal to and isolating we obtain:

Where is the identity matrix of dimension . Consider that and , so .

So we can make our prediction for a new input by substituting back into our linear regression model:

where is an -dimensional column vector with elements .

The good thing is that instead of inverting an matrix, we are inverting an matrix! This allows us to work with very high or infinite dimensionality of .

But how can we build a valid kernel?

We have mainly two ways to do it:

By construction: we choose a feature space mapping and use it to ﬁnd the corresponding kernel.
It is possible to test whether a function is a valid kernel without having to construct the basis function explicitly. The necessary and suﬃcient condition for a function to be a kernel is that the Gram matrix is positive semi-deﬁnite for all possible choices of the set . It means that for non-zero vectors with real entries, i.e. for any real number .

Mercer’s Theorem : Any continuous, symmetric, positive semi-deﬁnite kernel function can be expressed as a dot product in a high-dimensional space.

New kernels can be constructed from simpler kernels as building blocks; given valid kernels and the following new kernels will be valid:

where is a polynomial with non-negative coefficients.

Citation

For attribution, please cite this work as

Bonvini (2021, May 18). Last Week's Potatoes: The Kernel Trick.. Retrieved from https://lastweekspotatoes.com/posts/2021-07-22-the-kernel-trick/

BibTeX citation

@misc{bonvini2021the,
  author = {Bonvini, Andrea},
  title = {Last Week's Potatoes: The Kernel Trick.},
  url = {https://lastweekspotatoes.com/posts/2021-07-22-the-kernel-trick/},
  year = {2021}
}

The Kernel Trick.

Author

Affiliation

Published

Citation

Footnotes

Citation