What is the kernel trick? What’s the main advantage of this technique?
Traditionally, theory and algorithms of machine learning and statistics have been very well developed for the linear case. Real world data analysis problems, on the other hand, often require nonlinear methods to detect the kind of dependencies that allow successful prediction of properties of interest. By using a positive definite kernel, one can sometimes have the best of both worlds. The kernel corresponds to a dot product in a usually high-dimensional (possibly infinite) feature space. In this space, our estimation methods are linear, but as long as we can formulate everything in terms of kernel evaluations, we never explicitly have to compute in the high dimensional feature space! (this is called the Kernel Trick)
Suppose we have a mapping φ:Rd→Rm that brings our vectors in to some feature space Rm. Then the dot product of x and y in this space is φ(x)Tφ(y).
A kernel is a function k that corresponds to this dot product, i.e. $k(,)=()^T() $ .
Why is this useful? Kernels give a way to compute dot products in some feature space without even knowing what this space is and what is φ .
For example, consider a simple polynomial kernel k(x,y)=(1+xTy)2 with x,y∈R2.
This doesn’t seem to correspond to any mapping function φ , it’s just a function that returns a real number. Assuming that x=(x1,x2) and y=(y1,y2), let’s expand this expression:
k(x,y)=(1+xTy)2=(1+x1y1+x2y2)2=1+x21y21+x22y22+2x1y1+2x2y2+2x1x2y1y2
Note that this is nothing else but a dot product between two vectors:
φ(x)=φ(x1,x2)=(1,x21,x22,√2x1,√2x2,√2x1x2)
and
φ(y)=φ(y1,y2)=(1,y21,y22,√2y1,√2y2,√2y1y2)
So the kernel k(x,y)=(1+xTy)2=φ(x)Tφ(y) computes a dot product in a 6-dimensional space without explicitly visiting this space.
Another example is the Gaussian kernel k(x,y)=exp(−γ‖x−y‖2). If we Taylor-expand this function, we’ll see that it corresponds to an infinite-dimensional codomain of φ.
Instead, the simplest kernel is the linear kernel which corresponds to an identity mapping in the feature space: k(x,x′)=φ(x)Tφ(x′)=xTx
Moreover, the kernel is a symmetric function of its arguments: k(x,x′)=k(x′,x)
Many linear models for regression and classification can be reformulated in terms of dual representation in which the kernel function arises naturally ! For example if we consider a linear ridge regression model we know that we obtain the best parameters by minimizing the regularized sum of squares error function (ridge):
Lw=12N∑n=1(wTφ(xn)−tn)2+λ2wTw=12(Φw−t)T(Φw−t)+λ2wTw
Where Φ is the design matrix whose nth row is φ(xn)T (remember that in Lw all the vectors are column vectors) and t=(t1,...,tN)T is the target vector.
Setting the gradient of Lw w.r.t. w equal to 0 we obtain the following:
∂Lw∂w=0∂(12(Φw−t)T(Φw−t)+λ2wTw)∂w=0ΦT(Φw−t)+λw=0w=−1λΦT(Φw−t)w=ΦTa
Where a=−1λ(Φw−t) is a N×1 vector.
We observe that the coefficients an are functions of w. So our definition of w is function of w itself…which is surely weird, just wait for it…
We now define the Gram Matrix K=Φ×ΦT, an N×N matrix, with elements:
Knm=φ(xn)Tφ(xm)=k(xn,xm)
So, given N samples, the Gram Matrix is the matrix of all inner products
K=[k(x1,x1)…k(x1,xN)⋮⋱⋮k(xN,x1)…k(xN,xN)]
This will come in handy in a few seconds…
If we substitute w=ΦTa into Lw we get
Lw=12(Φw−t)T(Φw−t)+λ2wTw
Lw=12(ΦΦTa−t)T(ΦΦTa−t)+λ2(ΦTa)T(ΦTa)
La=12aTΦΦTΦΦTa−aTΦΦTt+12tTt+λ2aTΦΦTa
Guess what? we can rewrite the Loss function in terms of the Gram Matrix !
La=12aTKKa−aTKt+12tTt+λ2aTKa
By combining w=ΦTa and an=−1λ(wTφ(xn)−tn), setting the gradient w.r.t a equal to 0 and isolating a we obtain:
a=(K+λIN)−1t
Where IN is the identity matrix of dimension N. Consider that K=N×N and t=N×1, so a=N×1.
So we can make our prediction for a new input x by substituting back into our linear regression model:
y(x)=wTφ(x)=(ΦTa)Tφ(x)=aTΦφ(x)=k(x)T(K+λIN)−1t
where k(x) is an N-dimensional column vector with elements kn(x)=k(xn,x).
The good thing is that instead of inverting an M×M matrix, we are inverting an N×N matrix! This allows us to work with very high or infinite dimensionality of x.
But how can we build a valid kernel?
We have mainly two ways to do it:
By construction: we choose a feature space mapping φ(x) and use it to find the corresponding kernel.
It is possible to test whether a function is a valid kernel without having to construct the basis function explicitly. The necessary and sufficient condition for a function k(x,x′) to be a kernel is that the Gram matrix K is positive semi-definite for all possible choices of the set {xn}. It means that xTKx≥0 for non-zero vectors x with real entries, i.e.∑n∑mKn,mxnxm≥0 for any real number xn,xm.
⟹Mercer’s Theorem : Any continuous, symmetric, positive semi-definite kernel function k(x,y) can be expressed as a dot product in a high-dimensional space.
New kernels can be constructed from simpler kernels as building blocks; given valid kernels k1(x,x) and k2(x,x) the following new kernels will be valid:
For attribution, please cite this work as
Bonvini (2021, May 18). Last Week's Potatoes: The Kernel Trick.. Retrieved from https://lastweekspotatoes.com/posts/2021-07-22-the-kernel-trick/
BibTeX citation
@misc{bonvini2021the, author = {Bonvini, Andrea}, title = {Last Week's Potatoes: The Kernel Trick.}, url = {https://lastweekspotatoes.com/posts/2021-07-22-the-kernel-trick/}, year = {2021} }