A Basic Note on Smoothing Splines

My pursuit of fake mathematician status is on hold for a bit as I find myself getting deeper into statistical learning and optimization. I’m in a humanities online class that talks about statistical learning, taught by Trevor Hastie and Rob Tibshirani. I don’t think I’m really the target audience but I do think it helps me read their book, Elements of Statistical Learning by Hastie, Tibshirani, and Friedman [1]. I really like the material but sometimes I find it very difficult, so their high-level non-technical explanations help a lot. In one of their lectures they said that a particular proof was fun, and so of course I had to go look into it. It’s actually an exercise in their book, and I liked it a lot so I figured I’d write it up. As for my rule about not posting solutions to exercises in books, I have fewer reservations in this case because a solution manual is available here. Actually, before we look at the exercise I should give some relevant background and definitions (for my benefit as much as anyone else’s).

We’re concerned with the problem of fitting a function to data. This is of course really vague and there are tons of approaches, but the approach we’re going to use is called polynomial splines, more specifically cubic splines. This basically involves splitting the feature space into a few different regions and fitting a polynomial (cubic) to the data in each region. For convenience we’re going to work in a one-dimensional feature space, and the x-coordinates which define the regions we are splitting the feature space into will be called knots. We also have a few other requests; we want our function to be continuous and we want its first two derivatives to be continuous. In general when working with splines using polynomials of degree {M} we require {M-1} continuous derivatives. Hastie and company note that cubic splines are the most common and that making the second derivative continuous makes the transition imperceptible to the human eye. I have absolutely no idea why this is a good criterion concerning continuity – honestly I suspect it’s not, since I see no further discussion. But moving on, here’s a more formal-ish definition of a cubic spline.

Definition 1 A cubic spline interpolant with knots {\xi_1, \xi_2, \cdots, \xi_M} on feature space {X} and target set {Y} is a function {f: X \rightarrow Y} defined as follows:

f(x) = \left\{ \begin{array}{ll} f_0(x) & -\infty < x \leq \xi_1 \\ f_1(x) & \xi_1 < x \leq \xi_2 \\ \vdots & \vdots \\ f_M(x) & \xi_M < x < \infty\end{array}\right.

where {f_i(x)} is a cubic function with {f_i(\xi_{i+1}) = f_{i+1}(\xi_{i+1})}, {f_i'(\xi_{i+1}) = f_{i+1}'(\xi_{i+1})}, and {f_i''(\xi_{i+1}) = f_{i+1}''(\xi_{i+1})} for {i = 0,1, \cdots M-1}. A natural cubic spline is a cubic spline such that {f_0(x)} and {f_M(x)} are linear.

Presumably the {f_i} are actually fit to data in some way, although I suppose in a strictly technical sense that’s not required. A natural cubic spline is sometimes preferred because polynomials are not really to be trusted when they go off to infinity. There’s still a problem here, though. How do we actually pick the knots {\xi_1, \xi_2, \cdots, \xi_M}? I suppose in some scenarios there might be a definitive divide in the data, but in general it is not at all obvious. But like everything in statistical learning (at least in my experience so far), a simple idea comes to the rescue. Just make all the {x_i} knots! This is the maximal set of useful knots since adding more cannot improve the fit. This is called the smoothing spline. It’s not actually immediately clear why this is a great idea; while we will have minimal training error, why should we expect such an approach to produce a stable hypothesis function? That’s where the exercise posed by Professors Hastie and Tibshirani comes in.

Exercise 1 Given points {\{x_i, y_i\}^N} with {N \geq 2} and {a < x_i < b} for {i = 1,2,\cdots,N} , consider the following optimization problem:

\displaystyle \displaystyle\min_{f}{\left[\displaystyle\sum_{i=1}^{N}{(y_i - f(x_i))^2} + \lambda \displaystyle\int_{a}^{b}{(f''(t))^2 dt}\right]}

Show that the minimizer over all functions {f} of class {C^2} defined on {[a,b]} is a natural cubic spline interpolant with knots at each {x_i} ({C^2} is the class of functions which have continuous first and second derivatives).

This objective function has a simple interpretation; the first term is the residual sum of squares, and the second term is a regularization or penalty term with tuning parameter lambda which penalizes large second derivatives. With {\lambda = 0} any function that interpolates the data will be a minimizer, and with {\lambda = \infty} we will be forced to use a linear function, so the problem collapses to least squares, which is a sort of degenerate natural cubic spline. It is much more clear why the minimizer of this objective would be a good model than why just making all data points knots would produce a good model, but it turns out that they are actually one and the same.

We begin with the ever-classic proof technique. Let {g(x)} be the natural cubic spline interpolant to the pairs {\{x_i, y_i\}^N} and let {\tilde{g}(x)} be another function of class {C^2} interpolating {\{x_i, y_i\}^N}. We’re going to show that if {\tilde{g}} is as good as a solution as {g}, then it is equal to {g} on {[a,b]}. Let {h(x) = \tilde{g}(x) - g(x)}. It’s not too hard to show that {g} can perfectly interpolate the data (cubic splines with {N} knots are defined by a set of basis functions of size {N+4}), but we’ll just assume it here. Consider the following calculation, where we let {a = x_0} and {b = x_{j+1}} for convenience.

{= g''(x)h'(x) \big\vert^b_a - \int_{a}^{b}{h'(x)g'''(x)dx}} integration by parts
{= -\int_{a}^{b}{h'(x)g'''(x)dx}} {g''(b) = g''(a) = 0} since {g} is linear outside the knots
{= -\sum_{j=0}^{N}{\int_{x_j}^{x_{j+1}}{h'(x)g'''(x)dx}}} splitting the integral
{= -\sum_{j=0}^{N}{(g'''(x)h(x) \big\vert^{x_{j+1}}_{x_j} - \int_{x_j}^{x_{j+1}}{h(x)g''''(x)dx})}} integration by parts again
{= -\sum_{j=0}^{N}{g'''(x)h(x) \big\vert^{x_{j+1}}_{x_j}}} {g''''(x) = 0} since {g} is a cubic
{= -\sum_{j=1}^{N-1}{(g'''(x_{j+1})h(x_{j+1}) - g'''(x_j)h(x_j))}} {g} is linear outside the knots
{= -\sum_{j=1}^{N-1}{g'''(x_j)(h(x_{j+1}) - h(x_j))}} {g'''} is constant between knots
{= 0} {h(x_j) = \tilde{g}(x_j) - g(x_j) = y_j - y_j = 0}

If we plug {h(x) = \tilde{g}(x) - g(x)} into this result we see that it implies that {\int_a^b{(g''(x))^2dx} = \int_a^b{\tilde{g}''(x)g''(x)dx}}, and we can now use Cauchy-Schwarz (the operation {\langle f,g \rangle = \int{f(x)g(x)dx}} defines an inner product assuming {f} and {g} are square-integrable).

\displaystyle \begin{array}{rcl} \int_a^b{(g''(x))^2dx} \int_a^b{(\tilde{g}''(x))^2dx} &\geq& (\int_a^b{\tilde{g}''(x)g''(x)dx})^2 \\ \int_a^b{(g''(x))^2dx} \int_a^b{(\tilde{g}''(x))^2dx} &\geq& (\int_a^b{(g''(x))^2dx})^2 \\ \int_a^b{(\tilde{g}''(x))^2dx} &\geq& \int_a^b{(g''(x))^2dx} \\ \end{array}

Equality holds if and only if {g''(x) = \tilde{g}''(x)}, i.e. {h''(x)} is identically zero on {[a,b]}. This implies that their difference {h} is linear on {[a,b]}, but since they agree on {N \geq 2} points that must actually be zero, so {h(x)} is identically zero on {[a,b]}.

Armed with this result it’s apparent that the objective function will evaluate to something greater on any {\tilde{g}} that interpolates the data, so {g(x)} is the unique minimizer over all functions that perfectly interpolate the data. What about functions not perfectly interpolating the data? Could there exist a {\lambda} where some function that’s not a cubic spline will produce a slightly greater residual sum of squares but a significantly smaller penalty term? No; for any such function {\bar{f}}, let {y'_i} be the set of points it evaluates to on the {x_i}. We can find a cubic spline {f} that perfectly interpolates these points and run through the same argument with an objective function using the points {\{x_i, y'_i\}^N} to show that the penalty term will be smaller in {f}. Since the residual sum of squares in the original objective function is the same for {f} and {g} and the penalty term is the same across the two optimization problems, our cubic spline is optimal. We can safely conclude that our cubic spline is a unique minimizer.

There are actually a ton of other really cool ideas in this book. Hopefully I’ll find something to say about them instead of just doing one of the exercises, but honestly sometimes I just like to see a finished writeup. I will try to incorporate some of that next time.

[1] Hastie, Tibshirani, and Friedman. Elements of Statistical Learning, second edition. Last updated January 2013. A copy is available for free at http://statweb.stanford.edu/~tibs/ElemStatLearn/download.html.