A Basic Note on Smoothing Splines

My pursuit of fake mathematician status is on hold for a bit as I find myself getting deeper into statistical learning and optimization. I’m in a humanities online class that talks about statistical learning, taught by Trevor Hastie and Rob Tibshirani. I don’t think I’m really the target audience but I do think it helps me read their book, Elements of Statistical Learning by Hastie, Tibshirani, and Friedman [1]. I really like the material but sometimes I find it very difficult, so their high-level non-technical explanations help a lot. In one of their lectures they said that a particular proof was fun, and so of course I had to go look into it. It’s actually an exercise in their book, and I liked it a lot so I figured I’d write it up. As for my rule about not posting solutions to exercises in books, I have fewer reservations in this case because a solution manual is available here. Actually, before we look at the exercise I should give some relevant background and definitions (for my benefit as much as anyone else’s).

We’re concerned with the problem of fitting a function to data. This is of course really vague and there are tons of approaches, but the approach we’re going to use is called polynomial splines, more specifically cubic splines. This basically involves splitting the feature space into a few different regions and fitting a polynomial (cubic) to the data in each region. For convenience we’re going to work in a one-dimensional feature space, and the x-coordinates which define the regions we are splitting the feature space into will be called knots. We also have a few other requests; we want our function to be continuous and we want its first two derivatives to be continuous. In general when working with splines using polynomials of degree {M} we require {M-1} continuous derivatives. Hastie and company note that cubic splines are the most common and that making the second derivative continuous makes the transition imperceptible to the human eye. I have absolutely no idea why this is a good criterion concerning continuity – honestly I suspect it’s not, since I see no further discussion. But moving on, here’s a more formal-ish definition of a cubic spline.

Definition 1 A cubic spline interpolant with knots {\xi_1, \xi_2, \cdots, \xi_M} on feature space {X} and target set {Y} is a function {f: X \rightarrow Y} defined as follows:

f(x) = \left\{ \begin{array}{ll} f_0(x) & -\infty < x \leq \xi_1 \\ f_1(x) & \xi_1 < x \leq \xi_2 \\ \vdots & \vdots \\ f_M(x) & \xi_M < x < \infty\end{array}\right.

where {f_i(x)} is a cubic function with {f_i(\xi_{i+1}) = f_{i+1}(\xi_{i+1})}, {f_i'(\xi_{i+1}) = f_{i+1}'(\xi_{i+1})}, and {f_i''(\xi_{i+1}) = f_{i+1}''(\xi_{i+1})} for {i = 0,1, \cdots M-1}. A natural cubic spline is a cubic spline such that {f_0(x)} and {f_M(x)} are linear.

Presumably the {f_i} are actually fit to data in some way, although I suppose in a strictly technical sense that’s not required. A natural cubic spline is sometimes preferred because polynomials are not really to be trusted when they go off to infinity. There’s still a problem here, though. How do we actually pick the knots {\xi_1, \xi_2, \cdots, \xi_M}? I suppose in some scenarios there might be a definitive divide in the data, but in general it is not at all obvious. But like everything in statistical learning (at least in my experience so far), a simple idea comes to the rescue. Just make all the {x_i} knots! This is the maximal set of useful knots since adding more cannot improve the fit. This is called the smoothing spline. It’s not actually immediately clear why this is a great idea; while we will have minimal training error, why should we expect such an approach to produce a stable hypothesis function? That’s where the exercise posed by Professors Hastie and Tibshirani comes in.

Exercise 1 Given points {\{x_i, y_i\}^N} with {N \geq 2} and {a < x_i < b} for {i = 1,2,\cdots,N} , consider the following optimization problem:

\displaystyle \displaystyle\min_{f}{\left[\displaystyle\sum_{i=1}^{N}{(y_i - f(x_i))^2} + \lambda \displaystyle\int_{a}^{b}{(f''(t))^2 dt}\right]}

Show that the minimizer over all functions {f} of class {C^2} defined on {[a,b]} is a natural cubic spline interpolant with knots at each {x_i} ({C^2} is the class of functions which have continuous first and second derivatives).

This objective function has a simple interpretation; the first term is the residual sum of squares, and the second term is a regularization or penalty term with tuning parameter lambda which penalizes large second derivatives. With {\lambda = 0} any function that interpolates the data will be a minimizer, and with {\lambda = \infty} we will be forced to use a linear function, so the problem collapses to least squares, which is a sort of degenerate natural cubic spline. It is much more clear why the minimizer of this objective would be a good model than why just making all data points knots would produce a good model, but it turns out that they are actually one and the same.

We begin with the ever-classic proof technique. Let {g(x)} be the natural cubic spline interpolant to the pairs {\{x_i, y_i\}^N} and let {\tilde{g}(x)} be another function of class {C^2} interpolating {\{x_i, y_i\}^N}. We’re going to show that if {\tilde{g}} is as good as a solution as {g}, then it is equal to {g} on {[a,b]}. Let {h(x) = \tilde{g}(x) - g(x)}. It’s not too hard to show that {g} can perfectly interpolate the data (cubic splines with {N} knots are defined by a set of basis functions of size {N+4}), but we’ll just assume it here. Consider the following calculation, where we let {a = x_0} and {b = x_{j+1}} for convenience.

{= g''(x)h'(x) \big\vert^b_a - \int_{a}^{b}{h'(x)g'''(x)dx}} integration by parts
{= -\int_{a}^{b}{h'(x)g'''(x)dx}} {g''(b) = g''(a) = 0} since {g} is linear outside the knots
{= -\sum_{j=0}^{N}{\int_{x_j}^{x_{j+1}}{h'(x)g'''(x)dx}}} splitting the integral
{= -\sum_{j=0}^{N}{(g'''(x)h(x) \big\vert^{x_{j+1}}_{x_j} - \int_{x_j}^{x_{j+1}}{h(x)g''''(x)dx})}} integration by parts again
{= -\sum_{j=0}^{N}{g'''(x)h(x) \big\vert^{x_{j+1}}_{x_j}}} {g''''(x) = 0} since {g} is a cubic
{= -\sum_{j=1}^{N-1}{(g'''(x_{j+1})h(x_{j+1}) - g'''(x_j)h(x_j))}} {g} is linear outside the knots
{= -\sum_{j=1}^{N-1}{g'''(x_j)(h(x_{j+1}) - h(x_j))}} {g'''} is constant between knots
{= 0} {h(x_j) = \tilde{g}(x_j) - g(x_j) = y_j - y_j = 0}

If we plug {h(x) = \tilde{g}(x) - g(x)} into this result we see that it implies that {\int_a^b{(g''(x))^2dx} = \int_a^b{\tilde{g}''(x)g''(x)dx}}, and we can now use Cauchy-Schwarz (the operation {\langle f,g \rangle = \int{f(x)g(x)dx}} defines an inner product assuming {f} and {g} are square-integrable).

\displaystyle \begin{array}{rcl} \int_a^b{(g''(x))^2dx} \int_a^b{(\tilde{g}''(x))^2dx} &\geq& (\int_a^b{\tilde{g}''(x)g''(x)dx})^2 \\ \int_a^b{(g''(x))^2dx} \int_a^b{(\tilde{g}''(x))^2dx} &\geq& (\int_a^b{(g''(x))^2dx})^2 \\ \int_a^b{(\tilde{g}''(x))^2dx} &\geq& \int_a^b{(g''(x))^2dx} \\ \end{array}

Equality holds if and only if {g''(x) = \tilde{g}''(x)}, i.e. {h''(x)} is identically zero on {[a,b]}. This implies that their difference {h} is linear on {[a,b]}, but since they agree on {N \geq 2} points that must actually be zero, so {h(x)} is identically zero on {[a,b]}.

Armed with this result it’s apparent that the objective function will evaluate to something greater on any {\tilde{g}} that interpolates the data, so {g(x)} is the unique minimizer over all functions that perfectly interpolate the data. What about functions not perfectly interpolating the data? Could there exist a {\lambda} where some function that’s not a cubic spline will produce a slightly greater residual sum of squares but a significantly smaller penalty term? No; for any such function {\bar{f}}, let {y'_i} be the set of points it evaluates to on the {x_i}. We can find a cubic spline {f} that perfectly interpolates these points and run through the same argument with an objective function using the points {\{x_i, y'_i\}^N} to show that the penalty term will be smaller in {f}. Since the residual sum of squares in the original objective function is the same for {f} and {g} and the penalty term is the same across the two optimization problems, our cubic spline is optimal. We can safely conclude that our cubic spline is a unique minimizer.

There are actually a ton of other really cool ideas in this book. Hopefully I’ll find something to say about them instead of just doing one of the exercises, but honestly sometimes I just like to see a finished writeup. I will try to incorporate some of that next time.

[1] Hastie, Tibshirani, and Friedman. Elements of Statistical Learning, second edition. Last updated January 2013. A copy is available for free at http://statweb.stanford.edu/~tibs/ElemStatLearn/download.html.


Complexity of Determining Convexity of Functions

This wasn’t really my week in terms of results, but I decided to post what I was up to anyway. I always try to solve my problems without help from other sources. Sometimes I make up my own problems that I think shouldn’t be too hard and try to solve those by myself. Sometimes that works, but sometimes I am very incorrect about the difficulty of the problem.

I mentioned previously that I’m in CVX 101, an online class on convex optimization. The homework, unsurprisingly, involves using their software (called CVX, based in Matlab) to solve convex optimization problems. I unfortunately did not bother to read the users manual very carefully before attempting some optimization problems. I got a little frustrated because it wouldn’t let me use something like {x^2 + xy + y^2} as an objective function, even though that’s clearly convex. If I had actually read the users guide I would have known that CVX only allows objective functions constructed from a pre-specified set of rules, and functions like the one I was trying didn’t fall into that category. For example in this problem I rewrote the function as {\tfrac{1}{2}(x+y)^2 + \tfrac{1}{2}x^2 + \tfrac{1}{2}y^2}, which is a little inconvenient but CVX accepts since the sum of squares of affine functions is always convex. I asked myself why they would make life so difficult, and I had a guess. I was pretty sure that determining the convexity of a polynomial in {n} variables was NP-hard.

Doing anything with multivariate polynomials is hard in my experience. For example, determining whether a polynomial in {n} variables has at least one real zero is NP-hard. We can reduce from {3}-SAT. One of my professors insisted on stating decision problems in the following manner and I got used to it, so bear with me. Here’s the statement for {3}-SAT:

INSTANCE: A boolean formula in {3}-CNF form with {n} variables
QUESTION: Is the formula satisfiable?

And here’s the statement of the multivariate polynomial problem:

INSTANCE: A polynomial {f: \mathbb{R}^n \rightarrow \mathbb{R}}
QUESTION: Does {f} have a zero on {\mathbb{R}^n?}

For a clause with literals {x_i}, {x_j}, {\neg x_k} map it to a term in a polynomial {[(y_i)(y_j)(1-y_k)]^2}. Map the rest of the clauses appropriately, and let’s call the polynomial we formed {f(Y)}. We see that {f} has a zero if and only if our instance of {3}-SAT is satisfiable. If {3}-SAT has a satisfying assignment and {x_i} is true in that assignment, let {y_i = 0}, otherwise let {y_i = 1}. Clearly a satisfying assignment produces a zero of {f}. If {f} has a zero then all the terms are zero because they’re all nonnegative, and a term is zero if and only if one of the components is zero, so one of the variables is zero or one. Reverse the mapping and we have a satisfying assignment. A last detail is that if we have a variable that isn’t zero or one in the root of {f}, it doesn’t matter in the assignment so we can ignore it.

I’ve never gotten any real verification that thinking like this is a good idea, but when I see a reduction like that I generally think of the problem we reduced to as being “much harder” than the first problem. In this case we could have made all kinds of restrictions to {f}; we only needed a polynomial of degree {6} or less and a very specific type of polynomial. Well, {3}-SAT is already strongly NP-hard so determining zeroes in multivariate polynomials must be really, really hard, right? Of course in this case we know that the multivariate polynomial problem is in NP and so it’s capped in difficulty, so it sort of implies that going from the restricted types of polynomials to the rest of polynomials is actually easy.

Anyway, back to the convexity problem. Here’s a statement:

INSTANCE: A polynomial {f: \mathbb{R}^n \rightarrow \mathbb{R}}
QUESTION: Is {f} convex in {\mathbb{R}^n}?

If {f} is degree {1}, {2}, or {3} this is pretty simple. If it’s degree one the answer is always YES because it’s affine, so convex. If it’s degree three then the answer is always NO. We need only consider the cubic terms. If we find a term of the form {x_i^3} then taking the derivative with respect to {x_i} twice will make it linear in {x_i}, and so a diagonal element of the Hessian can be made negative, which implies that it’s not positive semidefinite by the extended version of Sylvester’s criterion. The same arguments works for terms of the form {x_i^2x_j}. If all the cubic terms are of the form {x_ix_jx_k} then all diagonal elements are zero, and the principal minor using rows and columns {i} and {j} is negative since it’s {0} on the diagonal and {x_k} on the off-diagonal. As for degree two polynomials, we can just write it in its quadratic form {x^TAX} plus some affine stuff. Then {f} is convex iff {A} is positive semidefinite, which we can either test for with Sylvester’s Criterion or old-fashioned high school math, row reduction. I’m not actually sure which is faster but they’re both polynomial time.

At this point I was feeling pretty good. I had knocked out a few cases, and all that remained was showing this for even degree {4} or higher polynomials. In fact if you can show that deciding the convexity of a degree {4} polynomial is hard then you can just tack on a term like {x_{n+1}^{2s}} for some {s > 2} which is of course convex and you’ll show that deciding the convexity of a higher degree polynomial is also NP-hard. I tried a bunch of stuff but nothing I did really got anywhere. After a few days I decided to give in and started googling for some answers.

It turns out that this is was solved somewhat recently. My guess that determining convexity is hard was correct. It’s proven here by Ahmadi, Olshevsky, Parrilo, and Tsitsiklis [1]. The general outline is that determining non-negativity of biquadratic forms, which are expressions of the form {b(x,y) = \sum_{i \leq j, k \leq l}x_ix_jy_ky_l} where {x = (x_1, \cdots, x_n)} and {y = (y_1 \cdots, y_m)} is NP-hard, as shown in [2]. That reduction uses Clique in [3]. In general I find it kind of difficult to make the jump from discrete structures to continuous objects in reductions. The actual paper is clear and shockingly easy to read, so I won’t summarize it here, but I recommend it. The authors in [1] showed the reduction from the biquadratic form problem to convexity, as well as a bunch of other results: the hardness of determining strict and strong convexity and a polynomial time algorithm for determining quasiconvexity.

At any rate, my plans for proving that determining the convexity of quartics is NP-hard were squashed, but I figured relaxing the problem to rational functions was an acceptable concession. After all, you can easily type a rational function into Matlab, so it’s still a concern for the coders of CVX. After a little bit I realized that there was a somewhat dirty way to do this problem as a reduction from the problem of determining whether a polynomial in {n} variables has a zero. As far as Matlab is concerned, a function like {\tfrac{x^2}{x}} can’t be evaluated at {x=0}. Using this you can just take a polynomial in {n} variables and determine whether {\tfrac{f(X)}{f(X)}} is convex, since if it has no zeroes then the polynomial is {1} which is convex and if it has one or more zeroes then it’s not convex according to Matlab because the domain isn’t even convex.

I don’t really like that reduction, so I kept searching. I also relaxed the problem even more, allowing myself to determine the convexity of functions with exponentials and logarithms. Unfortunately I am still on the hunt. If I do manage to find one I’ll edit it in here, but until then that’s all I’ve got.

[1] A.A. Ahmadi, A. Olshevsky, P.A. Parrilo, and J.N. Tsitsiklis. NP-hardness of Deciding Convexity of Quartic Polynomials and Related Problems. Mathematical Programming vol. 137 Issue 1-2, pages 453-476. A copy is available here.
[2] C. Ling, J. Nie, L. Qu, and Y. Ye. Biquadratic optimization over unit spheres and semidefinite programming relaxations. SIAM Journal on Optimization, 20(3):1286-1310, 2009. [3] T.S. Motzkin and E.G. Strauss. Maxima for graphs and a new proof of a theorem of Turan. Canadian Journal of Mathematics, 17:533-540, 1965. A copy is available here.