A Simple Combo Problem

As is quite evident from the sparse content of this blog, I have a tendency to jump around from subject to subject. As of late I have been making a concerted effort to become a better coder, an effort that takes literally all of my energy. It is disappointing to let all the ideas I had for this site fall by the wayside, but perhaps new ideas can take their places. I ran across a fun problem while practicing the other day, and although it is an easy problem for the experienced coder I was happy to solve it, so here it is.

The problem is taken directly from Codeforces and can be found here, and if you click on the link to the “tutorial” on the lower right side of the page you can find their solution. I didn’t read it too carefully but it looks like more or less the same thing. But you might as well read mine, right?

If you didn’t click the link, here’s the problem statement (sans their creative background story).

Exercise 1 Given two integers {n} and {k} with {1 \leq n,k \leq 2000}. How many sequences of integers {b_1, b_2, \cdots b_k} are there such that {1 \geq b_1 \geq b_2 \geq \cdots b_k} and {b_i \mid b_{i+1}} for {i = 1,2,\cdots,k-1}?

 I stared at this problem for a while with no helpful ideas. I knew there should be a recurrence of some sort, and the idea that removing any element from a sequence of length {k} generates a sequence of length {k-1} sort of supported this idea. Ultimately though I realized that none of this was making use of the key property of the sequences, the divisor requirement. This requires that all elements of the sequence be divisors of the last element, {b_k}. This made me think of the sequence as a process of accumulating prime factors of {b_k}, and since the sequences are easily partitioned by their last element, I figured this should lead somewhere.

Slightly more formally, let {B(n,k)} be the number of valid sequences in the question and let {A(n,k)} be the number of valid sequences of length {k} where {b_k = n} (end with {n}). Clearly {B(n,k) = \sum_{i=1}^{n}{A(i,k)}}, so if we can compute {A(i,k)} in reasonable time we should be good. I actually am terrible at estimating the time it will take code to run, it is something I definitely need to learn. For now we will steer clear of the discussion of complexity with the confidence that the algorithm “looks probably good enough.”

So now our challenge is to compute {A(n,k)}, and as we mentioned we are going to need the prime factorization. Actually it turns out we only need the exponents of the distinct prime factors, but I realized this far too late and didn’t really feel like modifying the code. Computing prime factorization is probably a well-studied topic, but for our purposes the simple process of dividing out all prime factors works fine. Here’s the code.

def prime_factor(n):
    pf = []
    d = 2
    while d**2 <= n:
        exp = 0
        while n % d == 0:
            n /= d 
            exp += 1
        if exp > 0:    
            pf.append((d,exp)) #The prime doesn't actually matter, but whatever
        d += 1
    if n > 1:
    return pf

So now we’ve got our prime factorization, but how do we use it? As I mentioned before I like to think of the sequence as the process of accumulating prime factors, so in the “gap” between each element of the sequence we throw in some new prime factors. We also create an artificial {1} at the beginning of the sequence since we can start from any divisor. Since numbers can repeat we aren’t required to put any prime factors in a given gap, and since the last number has to get to {n} we have to use all the prime factors eventually. Splitting them up by primes, say {n = p_1^{r_1}p_2^{r_2}\cdots p_s^{r_s}}, we can also just do the assignments of {p_1, p_2, \cdots, p_n} independently. There are {k} slots, one preceding each element of the sequence, we are permitted to have empty slots, and we must eventually assign all {r_i} {p_i}s. We recognize this as the number of nonnegative solutions to {x_1 + x_2 + \cdots + x_k = r_i} which is just {\binom{r_i+k-1}{k-1}}. Since the distinct prime factors can be dealt with independently this gives us our expression:

Proposition 1 If the prime factorization of {n} is {p_1^{r_1}p_2^{r_2}\cdots p_s^{r_s}}, then {A(n,k) = \displaystyle\prod_{i=1}^{s}{\binom{r_i+k-1}{k-1}}}.

The corresponding code shouldn’t be interesting, but apparently Python’s math library doesn’t feel like including a function to compute {\binom{n}{k}}. Writing one isn’t hard but I had to show off my impressive Python knowledge. Unfortunately doing so exposed the fact that I don’t know how to call the basic operators as functions, so it became slightly less than impressive.

def binom(n,k): #Because importing itertools or scipy seems overkill
    return reduce(mult, xrange(k+1,n+1)) / reduce(mult, xrange(1,n-k+1))

def mult(x,y): #To compensate for not knowing Python
    return x*y

The code for computing {B(n,k)} isn’t interesting at all, but I suppose it serves as a reasonable summary. Again it was hardly necessary to record the actual prime factors since we only need their exponents, but Python handled it nicely enough for me to not really care.

def B(n,k):
    s = 0
    for i in xrange(1,n+1):
        s += A(i,k)
    return s

def A(n,k): #Another function may have been unnecessary...
    pf = prime_factor(n)
    prod = 1
    for (p, exp) in pf:
        prod *= binom(exp+k-1, k-1)
    return prod

All in all there are a ton of improvements I could probably make. My binom function is mostly a joke, I could probably cache prime factorizations, etc. I was pretty happy with the solution though and it ran in well under the time limit on Codeforces, so futher optimization seemed a little unnecessary. I hope I can solve some more interesting problems as I improve as a coder – I say my goal is to improve coding but in reality my ability to design algorithms also lags far behind where I would like it. At the risk of being terribly mistaken once again, expect more of these types of posts in the future.


A Basic Note on Smoothing Splines

My pursuit of fake mathematician status is on hold for a bit as I find myself getting deeper into statistical learning and optimization. I’m in a humanities online class that talks about statistical learning, taught by Trevor Hastie and Rob Tibshirani. I don’t think I’m really the target audience but I do think it helps me read their book, Elements of Statistical Learning by Hastie, Tibshirani, and Friedman [1]. I really like the material but sometimes I find it very difficult, so their high-level non-technical explanations help a lot. In one of their lectures they said that a particular proof was fun, and so of course I had to go look into it. It’s actually an exercise in their book, and I liked it a lot so I figured I’d write it up. As for my rule about not posting solutions to exercises in books, I have fewer reservations in this case because a solution manual is available here. Actually, before we look at the exercise I should give some relevant background and definitions (for my benefit as much as anyone else’s).

We’re concerned with the problem of fitting a function to data. This is of course really vague and there are tons of approaches, but the approach we’re going to use is called polynomial splines, more specifically cubic splines. This basically involves splitting the feature space into a few different regions and fitting a polynomial (cubic) to the data in each region. For convenience we’re going to work in a one-dimensional feature space, and the x-coordinates which define the regions we are splitting the feature space into will be called knots. We also have a few other requests; we want our function to be continuous and we want its first two derivatives to be continuous. In general when working with splines using polynomials of degree {M} we require {M-1} continuous derivatives. Hastie and company note that cubic splines are the most common and that making the second derivative continuous makes the transition imperceptible to the human eye. I have absolutely no idea why this is a good criterion concerning continuity – honestly I suspect it’s not, since I see no further discussion. But moving on, here’s a more formal-ish definition of a cubic spline.

Definition 1 A cubic spline interpolant with knots {\xi_1, \xi_2, \cdots, \xi_M} on feature space {X} and target set {Y} is a function {f: X \rightarrow Y} defined as follows:

f(x) = \left\{ \begin{array}{ll} f_0(x) & -\infty < x \leq \xi_1 \\ f_1(x) & \xi_1 < x \leq \xi_2 \\ \vdots & \vdots \\ f_M(x) & \xi_M < x < \infty\end{array}\right.

where {f_i(x)} is a cubic function with {f_i(\xi_{i+1}) = f_{i+1}(\xi_{i+1})}, {f_i'(\xi_{i+1}) = f_{i+1}'(\xi_{i+1})}, and {f_i''(\xi_{i+1}) = f_{i+1}''(\xi_{i+1})} for {i = 0,1, \cdots M-1}. A natural cubic spline is a cubic spline such that {f_0(x)} and {f_M(x)} are linear.

Presumably the {f_i} are actually fit to data in some way, although I suppose in a strictly technical sense that’s not required. A natural cubic spline is sometimes preferred because polynomials are not really to be trusted when they go off to infinity. There’s still a problem here, though. How do we actually pick the knots {\xi_1, \xi_2, \cdots, \xi_M}? I suppose in some scenarios there might be a definitive divide in the data, but in general it is not at all obvious. But like everything in statistical learning (at least in my experience so far), a simple idea comes to the rescue. Just make all the {x_i} knots! This is the maximal set of useful knots since adding more cannot improve the fit. This is called the smoothing spline. It’s not actually immediately clear why this is a great idea; while we will have minimal training error, why should we expect such an approach to produce a stable hypothesis function? That’s where the exercise posed by Professors Hastie and Tibshirani comes in.

Exercise 1 Given points {\{x_i, y_i\}^N} with {N \geq 2} and {a < x_i < b} for {i = 1,2,\cdots,N} , consider the following optimization problem:

\displaystyle \displaystyle\min_{f}{\left[\displaystyle\sum_{i=1}^{N}{(y_i - f(x_i))^2} + \lambda \displaystyle\int_{a}^{b}{(f''(t))^2 dt}\right]}

Show that the minimizer over all functions {f} of class {C^2} defined on {[a,b]} is a natural cubic spline interpolant with knots at each {x_i} ({C^2} is the class of functions which have continuous first and second derivatives).

This objective function has a simple interpretation; the first term is the residual sum of squares, and the second term is a regularization or penalty term with tuning parameter lambda which penalizes large second derivatives. With {\lambda = 0} any function that interpolates the data will be a minimizer, and with {\lambda = \infty} we will be forced to use a linear function, so the problem collapses to least squares, which is a sort of degenerate natural cubic spline. It is much more clear why the minimizer of this objective would be a good model than why just making all data points knots would produce a good model, but it turns out that they are actually one and the same.

We begin with the ever-classic proof technique. Let {g(x)} be the natural cubic spline interpolant to the pairs {\{x_i, y_i\}^N} and let {\tilde{g}(x)} be another function of class {C^2} interpolating {\{x_i, y_i\}^N}. We’re going to show that if {\tilde{g}} is as good as a solution as {g}, then it is equal to {g} on {[a,b]}. Let {h(x) = \tilde{g}(x) - g(x)}. It’s not too hard to show that {g} can perfectly interpolate the data (cubic splines with {N} knots are defined by a set of basis functions of size {N+4}), but we’ll just assume it here. Consider the following calculation, where we let {a = x_0} and {b = x_{j+1}} for convenience.

{= g''(x)h'(x) \big\vert^b_a - \int_{a}^{b}{h'(x)g'''(x)dx}} integration by parts
{= -\int_{a}^{b}{h'(x)g'''(x)dx}} {g''(b) = g''(a) = 0} since {g} is linear outside the knots
{= -\sum_{j=0}^{N}{\int_{x_j}^{x_{j+1}}{h'(x)g'''(x)dx}}} splitting the integral
{= -\sum_{j=0}^{N}{(g'''(x)h(x) \big\vert^{x_{j+1}}_{x_j} - \int_{x_j}^{x_{j+1}}{h(x)g''''(x)dx})}} integration by parts again
{= -\sum_{j=0}^{N}{g'''(x)h(x) \big\vert^{x_{j+1}}_{x_j}}} {g''''(x) = 0} since {g} is a cubic
{= -\sum_{j=1}^{N-1}{(g'''(x_{j+1})h(x_{j+1}) - g'''(x_j)h(x_j))}} {g} is linear outside the knots
{= -\sum_{j=1}^{N-1}{g'''(x_j)(h(x_{j+1}) - h(x_j))}} {g'''} is constant between knots
{= 0} {h(x_j) = \tilde{g}(x_j) - g(x_j) = y_j - y_j = 0}

If we plug {h(x) = \tilde{g}(x) - g(x)} into this result we see that it implies that {\int_a^b{(g''(x))^2dx} = \int_a^b{\tilde{g}''(x)g''(x)dx}}, and we can now use Cauchy-Schwarz (the operation {\langle f,g \rangle = \int{f(x)g(x)dx}} defines an inner product assuming {f} and {g} are square-integrable).

\displaystyle \begin{array}{rcl} \int_a^b{(g''(x))^2dx} \int_a^b{(\tilde{g}''(x))^2dx} &\geq& (\int_a^b{\tilde{g}''(x)g''(x)dx})^2 \\ \int_a^b{(g''(x))^2dx} \int_a^b{(\tilde{g}''(x))^2dx} &\geq& (\int_a^b{(g''(x))^2dx})^2 \\ \int_a^b{(\tilde{g}''(x))^2dx} &\geq& \int_a^b{(g''(x))^2dx} \\ \end{array}

Equality holds if and only if {g''(x) = \tilde{g}''(x)}, i.e. {h''(x)} is identically zero on {[a,b]}. This implies that their difference {h} is linear on {[a,b]}, but since they agree on {N \geq 2} points that must actually be zero, so {h(x)} is identically zero on {[a,b]}.

Armed with this result it’s apparent that the objective function will evaluate to something greater on any {\tilde{g}} that interpolates the data, so {g(x)} is the unique minimizer over all functions that perfectly interpolate the data. What about functions not perfectly interpolating the data? Could there exist a {\lambda} where some function that’s not a cubic spline will produce a slightly greater residual sum of squares but a significantly smaller penalty term? No; for any such function {\bar{f}}, let {y'_i} be the set of points it evaluates to on the {x_i}. We can find a cubic spline {f} that perfectly interpolates these points and run through the same argument with an objective function using the points {\{x_i, y'_i\}^N} to show that the penalty term will be smaller in {f}. Since the residual sum of squares in the original objective function is the same for {f} and {g} and the penalty term is the same across the two optimization problems, our cubic spline is optimal. We can safely conclude that our cubic spline is a unique minimizer.

There are actually a ton of other really cool ideas in this book. Hopefully I’ll find something to say about them instead of just doing one of the exercises, but honestly sometimes I just like to see a finished writeup. I will try to incorporate some of that next time.

[1] Hastie, Tibshirani, and Friedman. Elements of Statistical Learning, second edition. Last updated January 2013. A copy is available for free at http://statweb.stanford.edu/~tibs/ElemStatLearn/download.html.

Basic definitions of a Topological Space

For an explanation of this type of post, go here.

In this post I’m basically just following along with General Topology by Kelley [1] aided by Wikipedia. Absolutely no thought went into this reference choice so I hope it isn’t impossibly difficult. I will also break my rule about posting solutions to exercises here because I don’t have confidence in making up my own exercises and I think this book is sufficiently old (published 1955) that no one is using it in class anymore. Elementary point-set topology, like elementary analysis, is at least fun to think about before it gets too abstract for my abilities. As always I use my reference for the statement of the theorems and the supplementary text and try to prove the theorems myself.

With topology I start from the very beginning, so I hope all readers with be able to follow along to some degree. The first question I had to ask is what the basic unit of study is. It turns out it’s a topological space. How is it defined? Apparently in a bunch of different ways, as we are about to see.

We first define a topological space in the following manner:

Definition 1 (Topology by Open Sets) A topological space {(X,\mathcal{T})} is an ordered pair of a set and a family of subsets with the following properties:

  1. Both {X} and {\emptyset} are in {\mathcal{T}}.
  2. For any {U,V \in \mathcal{T}}, {U \cap V \in \mathcal{T}}.
  3. The union of a family of subsets of {\mathcal{T}} is also in {\mathcal{T}}.

The set {X} is the space (universe) and {\mathcal{T}} is called the topology on {X}.

The second stipulation of course implies that the intersection of any finite number of sets in {\mathcal{T}} is still in {\mathcal{T}}. By contrast the third stipulation must not only hold for finite sets but also countably and uncountably infinite sets as well. Interestingly enough there are a bunch of ways to get to an equivalent definition according to Wikipedia, but we’ll stick to this one to begin with. This is the definition everyone seems to use. Kelley actually works in a marginally different way – he doesn’t have the first condition at all and just defines {X} to be the union of all sets in {\mathcal{T}}. Therefore {X} is in {\mathcal{T}} by definition. I find this a little strange because he talks about pairs {(X,\mathcal{T})} just like everyone else, but in the case of his definition {X} is already completely determined by {\mathcal{T}}, so I don’t really understand why this is done. His third condition is also worded slightly differently: he says that “the union of any subfamily of sets of {\mathcal{T}} is also in {\mathcal{T}}.” Under this phrasing he goes on to say that this implies that {\emptyset \in \mathcal{T}} since {\{ \}} is a subfamily of the family of sets and its union is {\emptyset}. I don’t want to get too technical for no reason so I’m just going to stick with the majority here. It’s just a matter of what to verify when showing something is a topology anyway.

The sets in {\mathcal{T}} are called the open sets. Technically openness is a property of the topology, so they are {\mathcal{T}}-open, but I think it will be mostly clear from context. If there are two topologies in question then we’ll specify which we are talking about. A go-to example of a topological space is of course the real numbers. The usual topology for the real numbers refers to the family of sets which contain an open interval about each of their points. First we’ll better define an open interval and verify that it in fact forms a topology:

Proposition 2 Let an open interval {I} be a subset of {\mathbb{R}} such that for every {x \in I} there exists {a,b \in \mathbb{R}} such that {(a,b)} contains {x} and is contained in {I}. Then the family of sets defined by all open intervals on {\mathbb{R}} is a topology.

Proof: Since there does not exist an {x \in \emptyset} it is included in our family of sets. Similarly if we consider {\mathbb{R}} then any {a,b} such that {a < x < b} is contained in {\mathbb{R}} so this too is in our family of sets.

If {I_1} and {I_2} are disjoint open intervals then their intersection forms the empty set, so that case is fine. If {I_1} and {I_2} are open intervals where there exists an {x \in I_1 \cap I_2}, let {x \in (a_1,b_1) \subset I_1} and {x \in (a_2,b_2) \subset I_2}. Then {\max(a_1,a_2) < x < \min(b_1,b_2)} is contained in {I_1 \cap I_2}, so this is again open.

Let {U = \bigcup_{a \in A} U_a} be the union of some family of open sets where {A} is an index set. Then for any {x \in U} we know {x \in U_a} for some {a \in A}, and so {U_a} must contain an interval {(a,b)} that contains {x} which in turn must be contained in {U}, so {U} is again open. \square

That proof is so easy Kelley didn’t even bother to include it, but I figured I’d write it up to decrease the probability I was misunderstanding everything. Of course under this definition open intervals conveniently correspond to open sets, and since a set is closed iff it is the complement of some open set, closed intervals correspond to closed sets. To my dismay (and the cost of about an hour of staring at a piece of paper), this particular naming scheme does not hold up in more general cases. The usual topology does not allow sets to be both open and closed, excepting the empty set and the entire space, but a generic topology allows this for more than just the two sets.

neighborhood of a point {x} is a set {U} which contains an open set containing {x}. A neighborhood system of a point {x} is the family of all neighborhoods of {x}. This sets up Kelley’s first two theorems:

Theorem 3 A set is {U} open if and only if it contains a neighborhood of {x} for every {x \in U}.

Proof: If a set {U} is open then it itself is a neighborhood of any {x \in U}, so that direction is simple. If {U} contains a neighborhood of {x} for every {x \in U}, it also contains an open set which contains {x}. Consider the union of all these open sets, call it {V}, which is by definition again open. Then clearly {U \subset V} since {x \in U} implies {x \in V}. We also have {V \subset U} since {V} is just a union of subsets of {U}, so {U=V} and {U} is open. \square

Theorem 4 If {\mathcal{U}} is the neighborhood system of a point {x}, then finite intersections of members of {\mathcal{U}} belong to {\mathcal{U}}, and each set which contains a member of {\mathcal{U}} belongs to {\mathcal{U}}.

Proof: The latter statement is kind of obvious, because a set which contains a neighborhood of {x} must contain an open set containing {x}, so it is also a neighborhood of {x}.

Let {U,V} be neighborhoods of {x}. Then there exist open sets {U',V'} containing {x} such that {U' \subset U} and {V' \subset V}. We see that {U' \cap V'} is open by definition and {U' \cap V' \subset U \cap V}, so {U \cap V} contains an open set containing {x} and is therefore a neighborhood of {x} \square

This brings us to another definition of a topological space, which we then show to be equivalent to our first definition.

Definition 5 (Topology by Neighborhoods) Let {\mathcal{U}} be a function which assigns to each {x \in X} a non-empty family of sets {\mathcal{U}_x} such that the following properties hold:

  1. If {U \in \mathcal{U}_x}, then {x \in U}.
  2. If {U} and {V} are members of {\mathcal{U}_x}, then {U \cap V \in \mathcal{U}_x}.
  3. If {U \in \mathcal{U}_x} and {U \subset V}, then {V \in \mathcal{U}_x}.
  4. If {U \in \mathcal{U}_x}, then there is a member {V} of {\mathcal{U}_x} such that {V \subset U} and {V \in \mathcal{U}_y} for each {y \in V}.

Then the family {\mathcal{T}} of all sets {U} such that {U \in \mathcal{U}_x} whenever {x \in U} is a topology on {X}.

Exercise 1 (1B(a) from Kelley) Show that if {(X,\mathcal{T})} is a topological space and for each {x} let {U_x} be the family of all neighborhoods of {x}. Then the four conditions in the above definition hold.

Proof: We’ll go down the list

  1. If {U \in \mathcal{U}} then it’s a neighborhood of {x} so by definition it contains an open set which contains {x}. Therefore it contains {x}.
  2. This is our second theorem.
  3. If {U \in \mathcal{U}_x}, then {U} contains an open set containing {x}. If {U \subset V} then {V} contains that same open set containing {x}, so {V \in \mathcal{U}_x}.
  4. We can choose {V} to be the open set contained in {U} which contains {x}. Clearly {V \subset U} and {V \in \mathcal{U}_x}. Since {V} is itself an open set it must be a neighborhood for all its points. \square

So these four conditions are implied by our open sets definition. As you might expect, the second part of the exercise is to show that these four conditions imply our open sets definition.

Exercise 2 (1B(b) from Kelley) If {\mathcal{U}} is a function which assigns to each {x \in X} a non-empty family of sets {\mathcal{U}_x} satisfying the first three conditions in the previous exercise, then the family {\mathcal{T}} of all sets {U}, such that {U \in \mathcal{U}_x} whenever {x \in U}, is a topology on {X}. If the fourth condition is satisfied, then {\mathcal{U}_x} is precisely the neighborhood system of {x} relative to the topology {\mathcal{T}}.

Proof: A more explicit definition of {\mathcal{T}} is {\{U : x \in U \Rightarrow U \in \mathcal{U}_x\}}.

The empty set is trivially in {\mathcal{T}} and the entire space is as well by the third condition, since it contains all sets (and {U_x} is non-empty). Let {U,V} be sets in {\mathcal{T}}. Let {x} be a point in {U \cap V}, so {x \in U} and {x \in V}. Since {U,V \in \mathcal{T}} we have {U,V \in \mathcal{U}_x}. Then by the second condition we have {U \cap V \in \mathcal{U}_x} so {x \in U \cap V \Rightarrow U \cap V \in \mathcal{U}_x} and {U \cap V \in \mathcal{T}}. Now let {K = \bigcup_{a \in A}{W_a}} with {W_a \in \mathcal{T}} for all {a \in A} (an index set). We know {x \in K} means that {x \in W_a} for some {a}, so {W_a \in \mathcal{U}_x} since {W_a \in \mathcal{T}}. But {W_a} is a subset of {K} so by the third condition {K \in \mathcal{U}_x}.

Finally, if we also have the fourth condition, we can show that {\mathcal{T} = \{U : x \in U \Rightarrow U \in \mathcal{U}_x\}} is the neighborhood system of {x}.

If {T \in \mathcal{T}} then if {x \in T} we have {T \in \mathcal{U}_x}. By the fourth condition there exists a {V} which is open (by theorem 3) and {V \subset T}, so {T} is a neighborhood of {x} (and is in the neighborhood system). If we have a set {V} which is a neighborhood of {x} then it contains an open set {Y} which contains {x}. We know that {Y \in \mathcal{U}_x} by the fourth condition so if {x \in V} then {V \in \mathcal{U}_x} by the third condition. Thus {x \in V} implies {V \in \mathcal{U}} so {V \in \mathcal{T}}. We conclude that {\mathcal{T}} is the neigbhorhood system of {x}. \square

We’re going to run through a few definitions and theorems now to prevent this from getting too long. An accumulation point {x} of a subset {A \subset \mathcal{T}} is a point for which all neighborhoods of {x} intersect {A} at a point other than {x}. This is just a step toward the way the neighborhoods definition of a topology expresses closed sets. A subset of a topological space is closed if and only if it contains all its accumulation points. These are of course all slightly generalized terms of what people learn in the first few weeks in analysis. The closure {\bar{A}} of a set {A} is the intersection of all the closed sets containing {A}. This leads to a theorem which I will write up the proof for, having skipped enough of them already. Unfortunately I got tripped up on this one so I had to go use Kelley’s proof, which starts with this preliminary theorem:

Theorem 6 The union of a set and the set of its accumulation points is closed.

Proof: For some {x \notin A \cup acc(A)} let {U} be an open set containing {x} and not intersecting {A}, which exists because {x \notin acc(A)}. In fact, no points of {U} are in {A} by definition and no points are in {acc(A)} since {U} is a neighborhood of all its points. Taking the union of all such {U} over all {x \notin A \cup acc(A)} we get an open set which is the complement of {A \cup acc(A)}, so {A \cup acc(A)} must be closed. \square

This makes the second half of the next theorem, the part I got stuck on, obvious.

Theorem 7 The closure of any set is the union of the set and the set of its accumulation points

Proof: Let {acc(A)} denote the set of accumulation points of {A}. If {x \in A} then it’s obviously in the closure of {A} since closed (or any) sets containing {A} must contain {x}. Similarly if {x \in acc(A)} then {x} must be an accumulation point of a set containing {A}, and since closed sets must contain all their accumulation points it must be in any closed set containing {A}. So if {x \in A \cup acc(A)} then {x \in \bar{A}}.

If {x \in \bar{A}} then it’s in every closed set containing {A}. Since {A \cup acc(A)} is closed by the previous theorem and obviously contains {A}, {x \in A \cup acc(A)}. \square

This gives us yet another way to define a topological space. A closure operator on {X} is an operator which assigns {A} to {A^c} (not to be confused with complement) such that the following axioms are satisfied:

Definition 8 (Kuratowski Closure Axioms) Let {c} be a closure operator on {X}.

  1. We have {\emptyset^c = \emptyset}.
  2. For all {A}, {A \subset A^c}.
  3. For all {A}, {(A^c)^c = A^c}.
  4. For all {A,B} we have {(A \cup B)^c = A^c \cup B^c}.

This gives us our definition:

Theorem 9 (Topology by Closure Operator) Let {c} be a closure operator on {X}, let {\mathcal{F}} be the family of all subsets {A} of {X} such that {A^c = A}, and let {\mathcal{T}} be the family of complements of members of {\mathcal{F}}. Then {\mathcal{T}} is a topology on {X} and {A^c = \bar{A}} (with respect to {\mathcal{T}}) for all {A \ subset X}.

Proof: We know that {\emptyset \in \mathcal{F}} by the first definition, so {X \in \mathcal{T}}. Similarly since {X^c = X} since {X \subset X^c}. Since we’re dealing with closed sets directly it’s probably easier to apply demorgan’s to the open sets definition of a Topology. We then need to show that the union of two sets in {\mathcal{F}} is in {\mathcal{F}} and the intersection of an arbitrary number of sets in {\mathcal{F}} is in {\mathcal{F}}. For {F_1, F_2 \in \mathcal{F}}, we have {(F_1 \cup F_2)^c = F_1^c \cup F_2^c = F_1 \cup F_2} using the fourth axiom, so {F_1 \cup F_2 \in \mathcal{F}}. If {K = \bigcap_{a \in A}{F_a}} where {A} is an index set and {F_a \in \mathcal{F}} for all {a}, we have the following chain: {K^c = (\bigcap{F_a})^c \subset \bigcap(F_a^c) = \bigcap(F_a) = K}, so {K^c \subset K}. By the second axiom {K \subset K^c}, so {K = K^c} and {K \in \mathcal{F}}.

Finally we want to show that {A^c} is actually just {\bar{A}}. I took the massive implied hint here and used the third axiom since it hasn’t come up yet. By that axiom we see that {A^c} is actually in {\mathcal{F}}. But {\bar{A}} is the intersection of the closed sets containing {A}, i.e. the sets in {\mathcal{F}} containing {A}, and since {A \subset A^c} we see that {\bar{A} \subset A^c}. For the other direction we note that {\bar{A} = A \cup B} for some (possibly empty) {B}. So {\bar{A}^c = (A \cup B)^c = A^c \cup B^c \supset A^c}, and since {\bar{A}} is actually in {\mathcal{F}} because it’s an intersection of elements of {\mathcal{F}}, we have {\bar{A}^c = \bar{A}}. Putting the two together yields {A^c \subset \bar{A}}, completing the proof. \square

To conclude we’ll look at one more way to specify a topological space after the relevant definitions. The interior of a set {A} is the set of {x \in A} such that {A} is a neighborhood of {x}. It’s denoted  A^i. I tend to think of the interior operator as the opposite of the closure operator, and indeed it can be shown that the interior is the largest open subset of {A}, as opposed to the smallest closed set containing {A}. This kind of duality holds up pretty well. The interior operator, as you might imagine, takes {A} to {A^i}. Now we’re going to define a topology one last time, based on this exercise:

Exercise 3 (1C in Kelley) If {i} is an operator which carries subsets of {X} into subsets of {X}, and {\mathcal{T}} is the family of all subsets such that {A^i = A}, under what conditions will {\mathcal{T}} be a topology for {X} and {i} the interior operator relative to this topology?

Definition 10 (Topology by Interior Operator) Let \textsuperscript{{i}} be an interior operator on {X} where {\mathcal{T}} is the family of subsets such that {A^i = A}, and let the following conditions hold:

  1. We have {X^i = X}.
  2. For all {A}, {A \supset A^i}.
  3. For all {A}, {(A^i)^i = A^i}.
  4. For all {A,B}, {(A \cap B)^i = A^i \cap B^i}.

Then {\mathcal{T}} is a topology on {X} and {i} is its interior operator.

It did not take a lot of creativity to come up with those conditions, of course. Since this post has more than its fair share of uninteresting proofs and this one is extremely similar to the previous one, I will omit it.

That admittedly got really long and dry. I wanted to do all the really mundane and boring exercises for practice and confirmation of understanding. I wouldn’t even recommend a reader go too carefully through the proofs to learn (if you are then I recommend reading a real resource). I am, however, hopeful that in coming posts we’ll get to some more tangible results. In those perhaps there will be more value in cursory reading.

Source: [1] Kelley, John. General Topology. Published 1955 by D. Van Nostrand Company, Inc.
Used for the statement of definitions, theorems, comments, and theorem proofs where specified.

The Isomorphism Theorems

For a short explanation of this type of post, go here.

In this post I’m using Dummit and Foote [1] and Wikipedia for reference. As a minor disclaimer some of Dummit and Foote’s proofs are more general. I made a few extra assumptions that at least in my understanding still give us results that are sufficiently powerful for our purposes. As with all these theorems the proofs are simple. I have omitted large chunks, mostly the parts I consider verification, but if you’re following along I would recommend at least mentally checking off properties. Dummit and Foote actually introduce quotient groups using homomorphisms rather than defining normal subgroups and then quotient groups but while I think that’s a reasonable approach I don’t think it’s how most people first learned the concept. I went with the traditional approach of beginning with normal subgroups.

Anyway, let’s get on to the material. First a brief and common lemma will make my life a bit easier:

Lemma 1 When {G} is a group and {K \leq G}, we have {g_1K = g_2K \Longleftrightarrow g_2^{-1}g_1 \in K}

Proof: If {g_2^{-1}g_1 \in K} then {g_2^{-1}g_1 = k} for some {k \in K}, and we have {g_1 = g_2k}, so {g_1K = g_2kK = g_2K}. Conversely if {g_1K = g_2K} then for some {k_1,k_2 \in K} we have {g_1k_1 = g_2k_2} so {g_1 = g_2k_2k_1^{-1} \Longrightarrow g_2^{-1}g_1 = k_2k_1^{-1} \in K}. \square

As a last comment, in case this notation isn’t standard, Dummit and Foote use {H \leq G} to say {H} is a subgroup of {G}, {N \trianglelefteq G} to say {N} is a normal subgroup of {G}, and {K \cong G} to say {K} is isomorphic to {G}.

Theorem 2 (The First Isomorphism Theorem) Let {\varphi: G \rightarrow H} be a surjective homomorphism. Then {\ker(\varphi) \trianglelefteq G} and {G / \ker(\varphi) \cong H}. For ease of writing we’ll let {K} be {\ker(\varphi)}.

Proof: First we’ll verify that {\ker(\varphi) \trianglelefteq G}.

  • Closure: For {k_1,k_2 \in K} we have {\varphi(k_1k_2) = \varphi(k_1)\varphi(k_2) = e_He_H = e_H}, so {k_1k_2 \in K}.
  • Identity: Since for any {g \in G} we have {\varphi(g) = \varphi(e_Gg) = \varphi(e_G)\varphi(g)}, {\varphi(e_G) = e_H} and so {e_H \in K}.
  • Inverses: For any {k \in K} we have {\varphi(k)\varphi(k^{-1}) = \varphi(kk^{-1}) = \varphi(e_G) = e_H}, but since {\varphi(k) = e_H} we also have that {\varphi(k^{-1}) = e_H} and so {k^{-1} \in K}.

To show that {G/K \cong H} we just need to show that {\phi: G/K \rightarrow H} where {\phi(gK) = \varphi(g)} is well-defined, a homomorphism, injective, and surjective.

  • Well-defined: If {g_1K = g_2K}, then {g_2^{-1}g_1 \in K} by lemma 1 and so {\varphi(g_2^{-1}g_1) = e_H}, {\varphi(g_2^{-1})\varphi(g_1) = e_H}, {\varphi(g_2^{-1})\varphi(g_1) = e_H}, and finally {(\varphi(g_2)^{-1})^{-1} = \varphi(g_1)}, so {\varphi(g_2) = \varphi(g_1)} which means that {\phi} is well defined.
  • Homomorphism: Since {\phi(g_1K)\phi(g_2K) = \varphi(g_1)\varphi(g_2) = \varphi(g_1g_2) = \phi(g_1g_2K) = \phi((g_1K)(g_2K))}, {\phi} is a homomorphism of groups.
  • Injective: This amounts to reversing the proof that {\phi} is well defined: We have {\phi(g_1K) = \phi(g_2K) \Longrightarrow \varphi(g_1) = \varphi(g_2) \Longrightarrow \varphi(g_1)\varphi(g_2^{-1}) = e_H \Longrightarrow \varphi(g_1g_2^{-1}) = e_H \Longrightarrow g_1g_2^{-1} \in K}, so we can conclude {g_1K = g_2K} by lemma 1 and so {\phi} is injective.
  • Surjective: For any {h \in H} there exists a {g \in G} such that {\varphi(g) = h} since {\varphi} is surjective, and so {\phi(gK) = \varphi(g) = h}, so {\phi} is surjective. \square

Dummit and Foote also point out an immediate consequence of this theorem that sort of sheds some light on its utility:

Corollary 3 If {\varphi: G \rightarrow H} is a group homomorphism then {\varphi} is injective iff {\ker(\varphi) = e_G}.

Proof: If {\varphi} isn’t surjective then we need only consider its image, so we can just assume that it is. If {\varphi} is injective then it’s a bijection so the statement is obvious. If {\ker(\varphi) = e_G} then {G / \ker(\varphi) = G} and so {\varphi} is an isomorphism, so it’s injective. \square

A picture I like to maintain in my head looks like this: group_lattice_1_im

Where {K} is {\ker(\varphi)}, {\varphi: G \rightarrow H}, {\phi: G/K \rightarrow H} is defined as {\phi(gK) = \varphi(g)}, and {\pi: G \rightarrow G/K} is defined as {\pi(g) = gK}. We can think of {\varphi} as {\pi \circ \phi}, and while this doesn’t immediately appear helpful, in the case where {\ker(\varphi) = 1} we can see that this is actually a lot of information about {\varphi}, since it fixes {G/K}, in this case {G}. By relating {\varphi} to this composition we preclude the idea that it might be mapping elements in an unequal fashion because of Lagrange’s theorem. In my mind I think of this as saying that the set of homomorphisms on {G} are in some sense equivalent to the set of normal subgroups, although this is mostly speculation.

The next two theorems heavily utilize the first, so hopefully the result makes sense. We’ll see some application

Theorem 4 (The Second Isomorphism Theorem) Let {G} be a group and let {S,N \leq G} with {N} normal. Then {SN \leq G}, {S \cap N \trianglelefteq S}, and {(SN)/N \cong S/(S \cap N)}.

Proof: The first two claims are a matter of verification, so I’ll skip those. As a result of those claims those we know that {S \cap N} is a normal subgroup of {S} so {S/(S \cap N)} is well defined. For the other quotient group {N} is normal in {G} by assumption and since {SN \leq G} it’s also normal in {SN} (note that this is not guaranteed for some {F} where {G \leq F}).

Here’s an outline of what we’re going to do. We want to show that {S/(S \cap N) \cong (SN)/N}, and we know that {G/\ker(\varphi) \cong \varphi(G)}. So it would be ideal if we could define a {\varphi: S \rightarrow (SN)/N} such that {\ker(\varphi) = S \cap N}. If we do this then by the first isomorphism we’ll be done.

An obvious guess at how to define {\varphi} is to let {\varphi(s) = sN}. Since all our quotient groups are legitimate this is well-defined and everything, and it’s easy to show that it’s a surjective homomorphism. Really all that remains to be shown is that {\ker(\varphi) = S \cap N}. The identity of {(SN)/N} is {N}, so the kernel of {\varphi} is the set of elements {x \in S} where {xN = N}. By lemma 1 this implies that {x \in N}, and so the kernel of {\varphi} is the set of elements in {N} but also in {S} by definition, so {S \cap N}. So we have a {\varphi} such that {\varphi: S \rightarrow (SN)/N} with {\ker(\varphi) = S \cap N}, so by the first isomorphism theorem we’re done. \square

I was actively trying to apply the first isomorphism theorem in that proof. I’m sure there are more direct ways to prove it, but I don’t think this one is too bad. It basically illustrates a situation where if you define a homomorphism you can make some non-obvious statements about the structure of {G}. In this case the second isomorphism theorem was here to tell me how to fit the pieces together to make the result seem nice, but the idea is more general. Here’s the relevant part of the lattice diagram of {G}, where quotient groups formed by subgroups joined by the red lines are isomorphic.group_lattice_3_im

Dummit and Foote also call this the diamond isomorphism theorem. The third isomorphism theorem, stated next, is another example of showing isomorphism by defining the appropriate homomorphism.

Theorem 5 (The Third Isomorphism Theorem) Let {G} be a group with {H,K \trianglelefteq G} and {H \leq K}. Then {(K/H) \trianglelefteq (G/H)} and {(G/H)/(K/H) \cong G/K}.

Proof: Again I’ll leave it to the reader to verify the first part. The last part is more fitting pieces into the first isomorphism theorem. Let {\varphi: G/H \rightarrow K/H} be defined by {\varphi(gH) = kH} which can be verified as a surjective homomorphism. Then {\ker(\varphi)} is the set of cosets of the form {gH} such that {gK \in K}, by again by lemma 1 this implies that {g \in K}, so the kernel is the set of cosets of the form {kH}, better known as the elements of {K / H}, and so by the first isomorphism theorem we’re done. \square

Putting aside the obvious jokes about canceling the {H}, after the first isomorphism theorem we pretty much went back to our symbol-manipulating ways. We could work through some examples of groups but I think there is only limited utility in that. A real application of the theorems is more desirable, and these do come up in solvable groups if I recall correctly. This gives me an excuse to get sidetracked in the Galois theory proof that quintics are not solvable in radicals. Personally this is one of my favorite results in all of math, which is why I was looking for an excuse to go learn about it.

Some authors, Dummit and Foote included, have a fourth isomorphism theorem that for a group {G} and normal subgroup {N} shows a bijection between the subgroups of {G} which contain {N} and the subgroups {G/N}. I don’t think this is any less useful than the other theorems, but I think enough algebra has been covered for our purposes. Next time I will write a little bit about definitions in topology, a subject I have no more than a week of formal education in, so despite the elementary level it will be a challenge. After that I will try to talk a little about Hilbert’s Nullstellensatz and basis theorem (which I am only marginally more educated in), and then we should be ready to at least attempt a foray into our end goal of algebraic geometry.

[1] Dummit and Foote. Abstract Algebra, third edition, pages 97-99. Published 2004.
Used for the statement of isomorphism theorems.

Studying Algebraic Geometry

I am not much of a mathematician, and I think it is unlikely I will ever be. I know some basic facts about analysis and algebra but certainly nothing beyond that. This makes me a little sad. Recently I’ve been trying to read Hartshorne’s book on algebraic geometry. Unfortunately given the extremely suspect state of my intuition combined with some shoddy recall, this has been a struggle to say the least. I have a tendency to bounce around from subject to subject a bit much though, so this time I will try to make a concerted effort to keep with it, even if that means I have to go back and re-learn basic math. To that end I’m going to write a few posts about algebra and how I attempt to understand the structure. There is absolutely nothing novel about this and it is more for me than for anyone else, but if I am able to pursue algebraic geometry in more depth I also hope they will serve as a small reference for any readers wishing to follow along. Also in this vein it should be noted that any general comments I make should be taken with heaps of salt as they are mostly just guesses about what I think may be important.

I won’t start entirely from the beginning even though that wouldn’t be entirely useless to me. Here are the Wikipedia pages for groups, rings, ideals, and fields. In the introduction of page on fields you can see the chain of algebraic structures. These names sometimes come up, but since I personally don’t do a good job of keeping them straight I’ll try to mention definitions when they do. We also will skip over the definitions of normal subgroups, quotient groups, and homomorphisms. Algebraic geometry (at least in my limited understanding) has commutative algebra as a prerequisite so the article on commutative rings is also relevant, but I’ll actually write a bit about that myself. To clarify one possible issue, I will always assume that rings always have a one.

A very fair question is what applications algebraic geometry has or what it even is about. To this I have no good answer. The Wikipedia page lists some areas it is useful in, but I must be clear that I have zero understanding of how it is applied in those fields. Hartshorne’s book itself makes a very brief statement but then returns to the question after the first chapter. I will attempt to do the same. At the very least I have been informed that it finds some application in biology, and despite my abhorrence of the subject, it’s hard not to find that a little interesting. With our minimal assumptions of knowledge I will write up a little bit about basic algebra and the most basic of definitions in topology before attempting to dive into Hartshorne. If you’re interested in a real resource, in school I worked out of Algebra by Michael Artin and Abstract Algebra by Dummit and Foote, both of which I thought were perfectly reasonable.

Even when I get to algebraic geometry my commentary will of course be less insightful than Hartshorne or any other author (or most math students). I hope there will be some value in watching someone struggle to learn the topic rather than reading the perspective of someone who knows it inside and out. As always, any comments are more than appreciated.

Complexity of Determining Convexity of Functions

This wasn’t really my week in terms of results, but I decided to post what I was up to anyway. I always try to solve my problems without help from other sources. Sometimes I make up my own problems that I think shouldn’t be too hard and try to solve those by myself. Sometimes that works, but sometimes I am very incorrect about the difficulty of the problem.

I mentioned previously that I’m in CVX 101, an online class on convex optimization. The homework, unsurprisingly, involves using their software (called CVX, based in Matlab) to solve convex optimization problems. I unfortunately did not bother to read the users manual very carefully before attempting some optimization problems. I got a little frustrated because it wouldn’t let me use something like {x^2 + xy + y^2} as an objective function, even though that’s clearly convex. If I had actually read the users guide I would have known that CVX only allows objective functions constructed from a pre-specified set of rules, and functions like the one I was trying didn’t fall into that category. For example in this problem I rewrote the function as {\tfrac{1}{2}(x+y)^2 + \tfrac{1}{2}x^2 + \tfrac{1}{2}y^2}, which is a little inconvenient but CVX accepts since the sum of squares of affine functions is always convex. I asked myself why they would make life so difficult, and I had a guess. I was pretty sure that determining the convexity of a polynomial in {n} variables was NP-hard.

Doing anything with multivariate polynomials is hard in my experience. For example, determining whether a polynomial in {n} variables has at least one real zero is NP-hard. We can reduce from {3}-SAT. One of my professors insisted on stating decision problems in the following manner and I got used to it, so bear with me. Here’s the statement for {3}-SAT:

INSTANCE: A boolean formula in {3}-CNF form with {n} variables
QUESTION: Is the formula satisfiable?

And here’s the statement of the multivariate polynomial problem:

INSTANCE: A polynomial {f: \mathbb{R}^n \rightarrow \mathbb{R}}
QUESTION: Does {f} have a zero on {\mathbb{R}^n?}

For a clause with literals {x_i}, {x_j}, {\neg x_k} map it to a term in a polynomial {[(y_i)(y_j)(1-y_k)]^2}. Map the rest of the clauses appropriately, and let’s call the polynomial we formed {f(Y)}. We see that {f} has a zero if and only if our instance of {3}-SAT is satisfiable. If {3}-SAT has a satisfying assignment and {x_i} is true in that assignment, let {y_i = 0}, otherwise let {y_i = 1}. Clearly a satisfying assignment produces a zero of {f}. If {f} has a zero then all the terms are zero because they’re all nonnegative, and a term is zero if and only if one of the components is zero, so one of the variables is zero or one. Reverse the mapping and we have a satisfying assignment. A last detail is that if we have a variable that isn’t zero or one in the root of {f}, it doesn’t matter in the assignment so we can ignore it.

I’ve never gotten any real verification that thinking like this is a good idea, but when I see a reduction like that I generally think of the problem we reduced to as being “much harder” than the first problem. In this case we could have made all kinds of restrictions to {f}; we only needed a polynomial of degree {6} or less and a very specific type of polynomial. Well, {3}-SAT is already strongly NP-hard so determining zeroes in multivariate polynomials must be really, really hard, right? Of course in this case we know that the multivariate polynomial problem is in NP and so it’s capped in difficulty, so it sort of implies that going from the restricted types of polynomials to the rest of polynomials is actually easy.

Anyway, back to the convexity problem. Here’s a statement:

INSTANCE: A polynomial {f: \mathbb{R}^n \rightarrow \mathbb{R}}
QUESTION: Is {f} convex in {\mathbb{R}^n}?

If {f} is degree {1}, {2}, or {3} this is pretty simple. If it’s degree one the answer is always YES because it’s affine, so convex. If it’s degree three then the answer is always NO. We need only consider the cubic terms. If we find a term of the form {x_i^3} then taking the derivative with respect to {x_i} twice will make it linear in {x_i}, and so a diagonal element of the Hessian can be made negative, which implies that it’s not positive semidefinite by the extended version of Sylvester’s criterion. The same arguments works for terms of the form {x_i^2x_j}. If all the cubic terms are of the form {x_ix_jx_k} then all diagonal elements are zero, and the principal minor using rows and columns {i} and {j} is negative since it’s {0} on the diagonal and {x_k} on the off-diagonal. As for degree two polynomials, we can just write it in its quadratic form {x^TAX} plus some affine stuff. Then {f} is convex iff {A} is positive semidefinite, which we can either test for with Sylvester’s Criterion or old-fashioned high school math, row reduction. I’m not actually sure which is faster but they’re both polynomial time.

At this point I was feeling pretty good. I had knocked out a few cases, and all that remained was showing this for even degree {4} or higher polynomials. In fact if you can show that deciding the convexity of a degree {4} polynomial is hard then you can just tack on a term like {x_{n+1}^{2s}} for some {s > 2} which is of course convex and you’ll show that deciding the convexity of a higher degree polynomial is also NP-hard. I tried a bunch of stuff but nothing I did really got anywhere. After a few days I decided to give in and started googling for some answers.

It turns out that this is was solved somewhat recently. My guess that determining convexity is hard was correct. It’s proven here by Ahmadi, Olshevsky, Parrilo, and Tsitsiklis [1]. The general outline is that determining non-negativity of biquadratic forms, which are expressions of the form {b(x,y) = \sum_{i \leq j, k \leq l}x_ix_jy_ky_l} where {x = (x_1, \cdots, x_n)} and {y = (y_1 \cdots, y_m)} is NP-hard, as shown in [2]. That reduction uses Clique in [3]. In general I find it kind of difficult to make the jump from discrete structures to continuous objects in reductions. The actual paper is clear and shockingly easy to read, so I won’t summarize it here, but I recommend it. The authors in [1] showed the reduction from the biquadratic form problem to convexity, as well as a bunch of other results: the hardness of determining strict and strong convexity and a polynomial time algorithm for determining quasiconvexity.

At any rate, my plans for proving that determining the convexity of quartics is NP-hard were squashed, but I figured relaxing the problem to rational functions was an acceptable concession. After all, you can easily type a rational function into Matlab, so it’s still a concern for the coders of CVX. After a little bit I realized that there was a somewhat dirty way to do this problem as a reduction from the problem of determining whether a polynomial in {n} variables has a zero. As far as Matlab is concerned, a function like {\tfrac{x^2}{x}} can’t be evaluated at {x=0}. Using this you can just take a polynomial in {n} variables and determine whether {\tfrac{f(X)}{f(X)}} is convex, since if it has no zeroes then the polynomial is {1} which is convex and if it has one or more zeroes then it’s not convex according to Matlab because the domain isn’t even convex.

I don’t really like that reduction, so I kept searching. I also relaxed the problem even more, allowing myself to determine the convexity of functions with exponentials and logarithms. Unfortunately I am still on the hunt. If I do manage to find one I’ll edit it in here, but until then that’s all I’ve got.

[1] A.A. Ahmadi, A. Olshevsky, P.A. Parrilo, and J.N. Tsitsiklis. NP-hardness of Deciding Convexity of Quartic Polynomials and Related Problems. Mathematical Programming vol. 137 Issue 1-2, pages 453-476. A copy is available here.
[2] C. Ling, J. Nie, L. Qu, and Y. Ye. Biquadratic optimization over unit spheres and semidefinite programming relaxations. SIAM Journal on Optimization, 20(3):1286-1310, 2009. [3] T.S. Motzkin and E.G. Strauss. Maxima for graphs and a new proof of a theorem of Turan. Canadian Journal of Mathematics, 17:533-540, 1965. A copy is available here.

Basic Facts About Graph Laplacians

Lately I’ve come across a lot of linear algebra. Apparently it’s pretty important. One topic I always thought was pretty fantastic was spectral graph theory, or analyzing graphs from their adjacency or similar matrices. I never really got a chance to look into it at school, but I started trying to understand some very basic results which I think are pretty fascinating. Of course, after I proved these results I looked at more organized approaches to spectral graph theory only to find that I had missed the main point entirely. That’s why I wouldn’t call the content of this post spectral graph theory. It’s more just facts about the Laplacian matrix.

For a simple graph {G} on {n} vertices we’ll define its Laplacian {L(G)} (I will drop the {G} from now on) as the n \times n matrix with entries  as follows:

L_{ij} = \left\{ \begin{array}{ll} \deg(v_i) & \text{if } i=j \\ -1 & \text{if } v_i,v_j \text{ are adjacent} \\ 0 & \text{otherwise}\end{array}\right.

This matrix has a lot of interesting properties. It’s obviously symmetric. It’s also singular, since it maps {\mathbf{1}} to {\mathbf{0}} because the degree is the only positive term and is exactly the number of {-1}s in each row. We can actually say more interesting things about its eigenvalues. First we’ll show that {L} is positive semidefinite, then we’ll show that the multiplicity of the eigenvalue {0} is exactly the number of components in {G}.

The fact that {L} is positive semidefinite is proven in a short and elegant fashion on Wikipedia. Naturally I disregarded it and set out to prove it on my own. This resulted in a proof that I am not really pleased with since it unnecessarily makes use of some pretty high-powered ideas to prove what is probably a very simple result.

At any rate, there are two facts we’re going to have to use that I won’t prove here. The first is Sylvester’s Criterion, which states that a symmetric matrix is positive definite if and only if its leading principal minors are positive. To review in case you forgot (because I did), a minor is the determinant of the matrix you get when you cut out a set of rows and a set of columns. A principal minor is the determinant of the matrix you get when the indices of those rows and columns are the same. A leading principle minor means you cut out the rows and columns from {k+1} to {n} (i.e. the upper left submatrix of size {k}).

The second fact we’re going to use is the extension of this to positive semidefinite matrices. It says that symmetric a matrix is positive semidefinite if and only if all its principal minors are nonnegative. I came across this fact when my friend linked me to his convex optimization homework here [1], where it is stated without proof. Unfortunately there doesn’t seem to be an elegant justification even assuming Sylvester’s criterion. Overall though it is a bit useless. Sylvester’s criterion at least provides a sort of computationally efficient way to test for positive definiteness. The extension clearly does no such thing, since it requires checking {O(2^n)} minors.

With those out of the way here’s how we’re going to proceed. We’re going to only deal with Laplacians of connected matrices for now and the extension will be simple enough. We will encounter matrices which are almost Laplacian but have least one diagonal element that is too large by an integer amount. Phrased another way, this is a matrix of the form {L + D} where {L} is a laplacian and {D} is a nonzero positive semidefinite diagonal matrix with integer entries. I’m going to call this a super-Laplacian. Be warned that this is probably horribly non-standard, although it shouldn’t be.

Our first claim is that a super-Laplacian matrix is positive definite. This can be shown by induction on {n} (the size of the matrix). The case where {n=1} is trivial. The inductive step needs a simple fact. If we take a {k \times k} super-Laplacian (or even a normal Laplacian) and throw out the same subset of rows and columns we’re going to get a super-Laplacian matrix. The way I understand this is throwing out rows and columns is like deleting a set of vertices from the graph, except we’re not reducing the degrees of the vertices that had edges going into the set we just threw out. Futhermore, since we’re only working with connected graphs for the time being, a vertex in the set of vertices we threw out had to be adjacent to some remaining vertex (and so that diagonal entry will be too large), so our modified matrix is super-Laplacian.

Getting back to the proof, we see that all the principal minors of a {k \times k} super-Laplacian are positive by our assumption. All that remains to be shown to satisfy Sylvester’s Criterion is that the determinant of the matrix itself is positive. Any {k \times k} super-Laplacian can be constructed by adding to its diagonal one entry at a time from a Laplacian matrix. When we increase one diagonal entry from the Laplacian, what happens to its determinant? Consider the traditional expansion along a row or column containing that entry. We increase it (increase since we’re on the diagonal so the associated sign is positive) by the minor resulting from throwing out that vertex, i.e. the determinant of a {(k-1) \times (k-1)} super-Laplacian. By our assumption this is positive and so we added a nonzero quantity to the determinant. Further additions do not cause problems since we are similarly increasing by the determinant of a super-Laplacian. So our {k \times k} matrix has positive determinant and by Sylvester’s Criterion it is positive definite.

We’re basically done now by the extended version of Sylvester’s Criterion. A principal minor of a Laplacian is the determinant of a super-Laplacian matrix which is nonnegative, so the Laplacian of a connected graph is positive semidefinite. Extension to graphs with {m>1} components follows by relabeling vertices (or switching rows and columns, depending on how you want to think about it) to get a block matrix of the form

{\left( \begin{array}{ccc} \mathbf{C_1} & & \mathbf{0} \\ & \ddots & \\ \mathbf{0} & & \mathbf{C_m} \end{array} \right)}

where {C_1 \cdots C_m} are positive semidefinite since they’re connected components. The block matrix is positive semidefinite, which can be seen straight from the definition. So all Laplacians are positive semidefinite.

The other result we’re going to show is a relation between the eigenvalues of the Laplacian and the number of components in the graph. What we just did already suggests as much, but we’ll show it a bit more explicitly. Let {\lambda_1, \lambda_2, \cdots, \lambda_n} be the eigenvalues of {L} ordered from least to greatest. Clearly they’re all nonnegative and there are {n} of them (counting multiplicity) since {L} is diagonalizable. We’re going to show that the multiplicity of the eigenvalue {0} is exactly the number of components in {G} (recall that every {L} has eigenvector {\mathbf{1}} and so {\lambda_1} is always {0}).

If {G} has {m} components, then we can find {m} linearly independent vectors in its null space. For ease of visualization we can relabel to form a block diagonal matrix as we did above, and then the independent eigenvectors are just {(\mathbf{1}, \mathbf{0}, \cdots, \mathbf{0}), (\mathbf{0}, \mathbf{1}, \cdots, \mathbf{0}), \cdots, (\mathbf{0}, \mathbf{0}, \cdots, \mathbf{1})} with the appropriate sizes of blocks. So {\lambda_m = 0}.

If {G} has fewer than {m} components, say {m-1} components, then we can find {m-1} columns (and rows) to throw out and we will have a super-Laplacian matrix, since we can throw out one vertex from each component giving us a block diagonal matrix of super-Laplacians. This is full rank, so our original matrix {L} is of rank at least {n-(m-1)}, so the dimension of the null space of {L} is at most {m-1}, and we can conclude that {\lambda_m > 0}.

Like I said, this is far from the heart of spectral graph theory, but it’s a fun thing to play around with since the concepts are really basic and easy to understand. Another thing to note that I haven’t proven: the second smallest eigenvalue of {L} is called the algebraic connectivity of {G}. We saw a little bit of how that might work since {\lambda_2 > 0} if and only if {G} is connected. Apparently a larger {\lambda_2} means it is more connected, although not quite in the sense of {k}-vertex or {k}-edge connectivity. That more or less sums up the basic facts on Wikipedia, but I’m sure there are many more simple observations to be made.

[1] The material for this course helped me a lot in trying to approach this problem, and of course provided the extended Sylvester’s Criterion. I’m in the Stanford Online version and I really enjoy it.

Balls and Bins with G(n,m)

It’s been a while since I’ve posted so I figured I’d post something short to get back in the swing of things. I spent a week chasing some results in spectral graph theory without knowing anything about it. I still have a hope to find something there but first I need to fill in some gaps in my knowledge. Anyway, recently I came across this problem set. I really like all the problems but the question about {G(n,m)} caught my attention. I figured I would look into it.

Apparently this is also a standard model, but I hadn’t heard of it. A graph {G(n,m)} is generated by taking {n} vertices and randomly placing {m} edges in the graph. The most notable difference is that vertex pairs are selected with replacement, so there is a possibility of parallel edges. The problem from that assignment allows for self-loops, which is just insanity. I also don’t want to post solutions to other people’s problems. But mostly it is insanity, so we’ll stick to the more standard model.

Actually, some natural questions about this model are well-known results. We can ask for what value of {m} do we get a double edge with high probability; this is just the birthday paradox. We can ask for what value of {m} do we have a subgraph that is complete; this is coupon collector. Another question we can ask is if {m=n}, what is a bound on the maximum number of parallel edges?

That last one probably isn’t as common but it was actually a homework problem for me once [1]. It was phrased as follows: If we toss {n} balls into {n} bins by selecting the bin for each ball uniformly at random, show that no bin contains more than {O(\tfrac{\ln(n)}{\ln(\ln(n))})} balls with high probability, more specifically {1-\tfrac{1}{n}}.

This an exercise in Chernoff bounds, which are essential in basically every randomized algorithm I’ve ever seen. It places bounds on the probability that a random variable with a binomial distribution is really far from its mean. Actually it can be the sum of negatively associated random variables, but that’s not really important here. The really nice thing is that the bound is exponentially small. There are a bunch of different forms but they’re all derived in a similar manner. The version we’re going to use is best when we want to bound the probability that a random variable is really far from its mean. Moving past this extremely technical terminology, the bound states that {Pr(X \geq k\mu) \leq (\tfrac{e^{k-1}}{k^k})^{\mu}}.

The overall plan, as per usual, is to bound the probability that any bin has more than {O(\tfrac{\ln(n)}{\ln(\ln(n))})} and then complete the problem by union bounding. For this to work we want to aim to bound the probability a given bin by {\tfrac{1}{n^2}}. We let {X} be the number of balls in a given bin, {X_i} is {1} if the {i}th ball is thrown in the bin, {0} otherwise. Clearly {X = \sum{X_i}} and the {X_i} are independent, so we can apply Chernoff. I’m also going to be really lazy with constants, it’s definitely possible to be more precise. We see that {Pr(X \geq k\mu) \leq (\tfrac{e^{k-1}}{k^k})^{\mu}}, so if {\mu = 1} and {k \in O(\tfrac{\ln(n)}{\ln(\ln(n))})} then

{Pr(X \geq k) \leq \tfrac{e^{k-1}}{k^k} \leq \tfrac{e^k}{k^k} \approx \tfrac{1}{k^k} \leq \tfrac{1}{n^2}}

That last step wasn’t obvious at all to me, so here’s a short justification. Let {k = \tfrac{c\ln(n)}{\ln(\ln(n))}} and

{k^{-k} = (\tfrac{c\ln(n)}{\ln(\ln(n))})^{-\tfrac{c\ln(n)}{\ln(\ln(n))}} \leq (\sqrt{\ln(n)})^{-\tfrac{c\ln(n)}{\ln(\ln(n))}} = (\ln(n))^{-\tfrac{c\ln(n)}{2\ln(\ln(n))}} = (\exp(\ln(\ln(n))))^{-\tfrac{c\ln(n)}{2\ln(\ln(n))}} = \exp(-c\tfrac{\ln(n)}{2}) \leq n^{-2}}

At this point we’re done, since the union bound implies that the probability any bin has more than {O(\tfrac{\ln(n)}{\ln(\ln(n))})} balls is less than {n\tfrac{1}{n^2} = \tfrac{1}{n}} and so the probability that no bin has exceeded {O(\tfrac{\ln(n)}{\ln(\ln(n))})} is {1-\tfrac{1}{n} \rightarrow 1}.

That didn’t actually have anything to do with the random graph model specifically, I suppose, but the balls in bins idea has obvious applications elsewhere, like hashing and job assignments. Returning to {G(n,m)}, we’re just going to look at the number of isolated vertices in a very brief and incomplete fashion.

A vertex is isolated with probability {(1 - \tfrac{n-1}{\binom{n}{2}})^m = (1 - \tfrac{2}{n})^m} since it’s basically repeated attempts to hit one of the {n-1} edges out of {\binom{n}{2}} possible. Thus the expected number of isolated vertices is {n(1-\tfrac{2}{n})^m}. If we let {m = cn\ln(n)} and take the limit we see that

{\displaystyle\lim_{n \rightarrow \infty}{n(1-\tfrac{2}{n})^{c\ln(n)}} = \displaystyle\lim_{n \rightarrow \infty}{n}\displaystyle\lim_{n \rightarrow \infty}{(1-\tfrac{2}{n}}^{n})^{c\ln(n)} = \displaystyle\lim_{n \rightarrow \infty}{ne^{-2c\ln(n)}} = \displaystyle\lim_{n \rightarrow \infty}{n^{1-2c}}}

and so if {c > \tfrac{1}{2}} there is almost surely no isolated vertex. If this looks familiar it’s because it’s basically the same thing as {G(n,p)}. This is kind of obvious anyway. At least from the perspective of isolated vertices the fact that you can have multiple edges makes very little difference. The only change is the probability of getting an edge adjacent to a vertex is now fixed and we’re altering the number of attempts. The variance computation also appears to be identical, so I won’t put it here.

When I was trying to make my own modifications to {G(n,p)} for weighted graphs I was trying all kinds of strange things. I tried generating an edge with probability {p} and then selecting a weight from {[n]} with different distributions. Strange things happened. This is also a pretty nice and easy to understand model. I’ll come back to it once I have learned something about spectral graph theory.

[1] I looked around for the website of the class I took the balls and bins problem from, but I can’t seem to find it. The class was Probability and Computing, 15-359, taught by professors Harchol-Balter and Sutner in the spring of 2012.

Loose Bounds on Connectivity in G(n,p)

One of the popular models for random graphs is Erdos-Renyi. We generate {G(n,p)} by taking {n} nodes and between every pair generating an edge with probability {p}. This is nice because it makes computing properties of the graph really easy. The downside is that very few structures actually follow this model – for instance in social networks if {u} and {v} are adjacent and {v} and {w} are adjacent, then {u} and {w} are more likely to be adjacent. This is clearly not the case in {G(n,p)}. It’s a shame, but at least it’s fun to play with.

A monotone property of a graph is one that is always preserved by adding more edges. For a monotone property {P} then it is a natural question to ask for what values of {p} does {G(n,p)} have {P} with probability one as {n} goes to infinity? Since the property is monotone there will be some function of {p = f(n)} at which point {G(n,f(n))} almost surely has property {P}. There are in fact sharp thresholds for many properties (as always for a more complete discussion of this , check out [1]). We’ll focus on connectivity, clearly a monotone property.

We’re going to take a shot at bounding the threshold value from below. Our first approach is an extremely loose bound. Let {X} be the random variable representing the number of edges in {G(n,p)} (we’ll just call it {G} from now on). If {E[X]} goes to zero then with high probability {G} has no edges at all. This is evident from Markov’s (Hopcroft and Kannan call it the First Moment Method). Since {P(X \geq a) \leq \tfrac{E[X]}{a}}, if {E[X]} drops to zero then the probability there are any edges at all drops to {0}. Since the expected number of edges in the graph is {E[X] = \binom{n}{2}p = \tfrac{n(n-1)p}{2}}, we see that if {p = \tfrac{c}{n^{2+\epsilon}}} where {\epsilon > 0} then {\lim_{n \rightarrow \infty}{E[X]} = \lim_{n \rightarrow \infty}{\tfrac{n(n-1)}{2}\tfrac{c}{n^{2+\epsilon}}} \leq \lim_{n \rightarrow \infty}{\tfrac{c}{2n^{\epsilon}}} = 0}. So if {p = n^{-(2+\epsilon)}} when {\epsilon > 0}, {G} almost surely has no edges and is therefore disconnected with high probability.

Actually we can do a little better than this even with our super-naive approach. We don’t need {E[X]} to drop to zero. A graph is disconnected if it has fewer than {n-1} edges, so if {Pr(X \geq n-1)} goes to zero, {G} is almost surely disconnected. Therefore all we need is that {E[X] \in o(n)}. This revision suggests that {p = \tfrac{c}{n^{1+\epsilon}}} will do the trick since {\lim_{n \rightarrow \infty}{E[X]} = \lim_{n \rightarrow \infty}{\tfrac{n(n-1)p}{2}} \leq \lim_{n \rightarrow \infty}{\tfrac{n^2p}{2}} = \lim_{n \rightarrow \infty}{\tfrac{cn^2}{2n^{1+\epsilon}}} = \lim_{n \rightarrow \infty}{\tfrac{cn^{1-\epsilon}}{2}}}, and so {\lim_{n \rightarrow \infty}{\tfrac{E[X]}{n-1}} = \lim_{n \rightarrow \infty}{n^{-\epsilon}}}. For {\epsilon > 0} this drops to {0} so {G} is almost surely disconnected.

It should come as no surprise to find that this is still a terrible, terrible bound. There’s a long way between having {n-1} edges and being connected, since a graph with {n-1} randomly placed edges is very unlikely to be connected. We can achieve a marginally better bound without doing too much work though. A graph is connected if and only if it has a spanning tree. If a graph almost surely does not have a spanning tree then it is disconnected with high probability. Let {X} be the number of spanning trees in {G}. If {X} drops to {0} then {G} almost surely does not have a spanning tree and so is almost surely disconnected. We know {E[X] = n^{n-2}p^{n-1}}. If {p = \tfrac{c}{n}} then {E[X] = \tfrac{c(np)^{n-1}}{n} = \tfrac{c}{n}}, which goes to {0}. So our new lower bound for {p} is {\tfrac{1}{n}}, better by literally an infinitesimal amount.

The actual optimal lower bound for {p} is {\tfrac{c \ln(n)}{n}} where {c < 1}. In fact, connectivity experiences what is called a sharp threshold. That is, there exists a {p = f(n)} such that {G} almost surely is disconnected for {p = cf(n)} where {c < 1} and is almost surely connected for {c > 1}. The proof is a little messy for my tastes, so we’ll just finish the process of bounding {p} from below. The following proof is just taken from Hopcroft and Kannan.

We will now investigate the expected number of isolated vertices (vertices with degree {0}). Clearly if {G} almost surely has an isolated vertex then {G} is almost surely disconnected. Let {X} be the number of isolated vertices, so {E[X] = n(1-p)^{n-1}}. If {p = \tfrac{c \ln(n)}{n}}, then {E[X] = n(1-p)^{n-1} = n(1-\tfrac{c\ln(n)}{n})^{n-1}}, so {\lim_{n \rightarrow \infty}{E[X]} = \lim_{n \rightarrow \infty}{n(1-\tfrac{c \ln(n)}{n})^n} = \lim_{n \rightarrow \infty}{ne^{-c\ln(n)}} = n^{1-c}}. So if {c > 1} then {G} almost surely has no isolated vertices.

Unfortunately this is not actually enough to tell us that for {c < 1} there are almost surely isolated vertices. It could be that most isolated vertices are all in some small set of graphs and the rest all have no isolated vertices. To fix this we need what Hopcroft and Kannan call the second moment method, which is basically using Chebyshev’s to show that if {\tfrac{Var(X)}{E^2[X]}} goes to {0} then {X} is almost surely not zero since {Pr(X=0) \leq Pr(|X - E[X]|) \leq \tfrac{Var(X)}{E^2[X]}}. An immediate consequence of this by the definition of variance is that if {E[X^2] \leq E^2[X](1+o(1))} then {X} is almost surely greater than zero.

This is generally a little messier since variance computations are not quite as nice as computing the mean, but in this case it isn’t too bad. We want {E[X^2]}. Splitting it up into the typical indicator random variables gives us {E[X^2] = E[(x_1 + \cdots x_n)^2] = \sum_{i=1}^{n}{x_i^2} + 2\sum_{i < j}{x_ix_j}}. Since {x_i^2 = x_i}, that part is {E[X]}. There are {n(n-1)/2} terms in the second sum and each term is {(1-p)^{2n-2-1}} where we avoided double-counting the edge between vertices {i} and {j}. Now we can just compute {\tfrac{E[X^2]}{E^2[X]}}, substitute in {p = \tfrac{c\ln(n)}{n}} where {c < 1} and send it to infinity.

\begin{array}{ccc}  \frac{E[X^2]}{E^2[X]} &=& \frac{E[X] + n(n-1)(1-p)^{2n-3}}{E^2[X]} \\  &=& \frac{1}{E[X]} + \frac{n(n-1)(1-p)^{2n-3}}{n^2(1-p)^{2n-2}}\\  &=& \frac{1}{E[X]} + (1-\frac{1}{n})\frac{1}{1-p}\\  &=& 1+o(1)\\  \end{array}

So {G} almost surely has an isolated vertex is {c < 1}.

As mentioned before, if {p = c\frac{\ln(n)}{n}} and {c > 1}, it is possible to show that {G} is almost surely connected (instead of that it almost surely has no isolated vertices). This is shown by Hopcroft and Kannan by considering the “giant component.” In {G(n,p)} as {p} increases a single component contains most ({n^{2/3}}) of the vertices, and the rest are isolated vertices. Once {p} is {\tfrac{c\ln(n)}{n}} where {c > 1} the isolated vertices disappear because the giant component has swallowed the whole graph.

Erdos-Renyi is an interesting model to play around with, mostly because it’s the only model I can easily follow most of the math for. For a while I was trying to apply some basic spectral graph theory techniques to it, but I couldn’t make it hold up except in the most basic of facts. For instance the expected Laplacian of {G(n,p)} is obvious, and if you apply Kirchhoff’s Theorem to it then you get the expected number of spanning trees. What I couldn’t do was figure out the expected second smallest eigenvalue, which is zero if and only if {G} is connected. Perhaps a little more creativity was needed on that front. Hopefully I get a chance to review it.

[1] Foundations of Data Science by Hopcroft and Kannan. A draft can be found at http://www.cs.cornell.edu/jeh/book112013.pdf
Used for the proof of the sharp threshold for the existence of isolated vertices.

The Irwin-Hall Distribution

A while ago I came across this problem from Stanford’s Putnam seminar:

Two points are independently selected uniformly at random from a line of length {b}. What is the probability that they are at least a distance {a} apart, where {b > a}? (Putnam 1961 B2)

This problem has a very neat solution which exploits what I find to be a rather clever technique. We put the line on a coordinate system and say it runs from {0} to {b}. Let {X_1} be the coordinate of the first point and {X_2} be the coordinate of the second, where {X_1} and {X_2} are random variables which are clearly uniformly distributed. Now draw the unit square, as in the following diagram [1]:


The shaded region represents the area in which the points are at least {a} apart. It has area {2(\tfrac{1}{2}(b-a)^2) = (b-a)^2}, so the probability the two points are at least {a} apart is simply {\tfrac{(b-a)^2}{b^2}}.

This trick can be applied to a whole host of problems. As far as I can tell the only requirement is that we are sampling from uniform distributions. One somewhat classic problem is to find the probability density function of a random variable {Y = X_1 + X_2} where {X_1, X_2 \sim U(0,1)}. By the same trick we can compute the CDF of {Y}, we just need to split it into cases:


The CDF is therefore {F_Y(s) = \tfrac{s^2}{2}} when {0 < s < 1} and {F_Y(s) = 1 - \tfrac{s^2}{2}} when {1 < s < 2}. Differentiating gives us the triangular distribution {f_Y(s) = s} when {0 < s < 1} and {f_Y(s) = -s} when {1 < s < 2}.

Looking at the last post I thought an interesting idea would be to generalize this to {n} uniform random variables. This is of course a solved problem: the density function is known as the Irwin-Hall distribution. I wanted to take a run at it from our high-dimensional geometry perspective, since we can apply the same idea as above to the unit cube in {n} dimensions to compute the CDF of Irwin-Hall.

Before we attack this we need to know how to compute the volume of the standard {n}-simplex, or the volume of the set {\{\vec{x} | \sum{x_i} < 1\}} where the {x_i}s are nonnegative. We’ll follow the explanation given in [2]. First let’s consider the simplex in {\mathbb{R}^n} defined by {(1,0,\cdots,0), (1,1,0,\cdots,0), \cdots, (1,1,\cdots,1)} and the origin. This is just the set {\{\vec{x} | 1 \geq x_1 \geq x_2 \cdots \geq x_n \geq 0\}}. We can fit exactly {n!} of these in the unit cube since there are {n!} such orderings and each point in the unit cube is in one of these simplices (I’m fudging some issues with the boundaries but it should be fine I think). Thus the original simplex has volume {\tfrac{1}{n!}}. We can map this simplex to the standard simplex by the transformation {y_n = x_n; y_{n-1} = x_n + x_{n-1}; \cdots; y_1 = y_n + y_{n-1} + \cdots + y_1}. This is a transformation with determinant one so the volume remains unchanged, so the volume of the standard simplex is {\tfrac{1}{n!}}.

I drew some pictures of the three dimensional case since some interesting things happen there. We’re keeping track of the volume underneath the plane {x_1 + x_2 + x_3 = s} that’s contained within the unit cube. Similar to the two-dimensional case, our function will be piecewise since it changes when it hits vertices of the cube. First our simplex increases at an obvious rate, maintaining a volume of {\tfrac{s^3}{3!}}. It looks kind of like this:


Then the function changes when {s} passes {1}. The way we compute the new volume is by taking the volume of the red simplex and subtracting out the volume of the blue simplices (outlined in the following figure), which gives us {\tfrac{s^3}{n!} - \binom{3}{1}\tfrac{(s-1)^3}{n!}}.


Finally the function changes for a last time when {s} passes {2}. We could just compute the volume of the missing simplex and subtract that from one, but we’ll try a method that works a little more generally. Note that the blue simplices have overlapped to form new simplices which are outlined in green below. Our method of computation is to take the red simplex, subtract out the blue simplices, and then add back in the green simplices. It’s basically inclusion-exclusion. This gives us {\tfrac{s^3}{n!} - \binom{3}{1}\tfrac{(s-1)^3}{n!} + \binom{3}{2}\tfrac{(s-2)^3}{3!}}.


This appears to generalize in a fairly simple manner. Note that our function changes based on the integer part of {s}. The first piece of our function is naturally just {\tfrac{s^n}{n!}}. When {s} hits {1}, the plane hits exactly {\binom{n}{1}} vertices (more precisely it hits {(1,0,\cdots,0) \cdots (0,\cdots,1)}). Our volume computation then requires that we subtract out the newly formed simplices. These have side length {s-1} and there are exactly {\binom{n}{1}} of them, since one forms at each vertex.

If {\lfloor s \rfloor = i} then we see that we will be computing {\tfrac{1}{n!}\sum_{k=0}^{i}{(-1)^k \binom{n}{k} (s-k)^n}}. Each time {s} crosses an integer {i} the plane hits {\binom{n}{i}} vertices (choose which {i} of the coordinates are {1}) and exactly that many new simplices are formed to be added or subtracted out, depending on parity. Thus our CDF is exactly

{P(X < s) = \tfrac{1}{n!} \displaystyle\sum_{k=0}^{\lfloor s \rfloor}{(-1)^k \binom{n}{k} (s-k)^n}}

And differentiating with respect to {s} gives us our PDF

{f_X(s) = \tfrac{1}{(n-1)!} \displaystyle\sum_{k=0}^{\lfloor s \rfloor}{(-1)^k \binom{n}{k} (s-k)^{n-1}}}

Admittedly there are easier ways to go about proving the expressions for the PDF or CDF are valid, but I like this one since it minimizes the amount of “clairvoyance” involved. As a whole I find this trick with the unit cube to be fairly useful whenever solving problems with uniform distributions. I’m not sure of legitimate practical applications but it seems to come up in contest math from time to time. Hopefully someday I will spot an opportunity to use it in a real-world setting.

[1] I drew the figures but I basically stole and modified the code found here: http://tex.stackexchange.com/a/75974
[2] Foundations of Data Science by Hopcroft and Kannan.
A draft can be found here: http://www.cs.cornell.edu/jeh/book112013.pdf
Used for the computation of volume of a standard simplex