Balls and Bins with G(n,m)

It’s been a while since I’ve posted so I figured I’d post something short to get back in the swing of things. I spent a week chasing some results in spectral graph theory without knowing anything about it. I still have a hope to find something there but first I need to fill in some gaps in my knowledge. Anyway, recently I came across this problem set. I really like all the problems but the question about {G(n,m)} caught my attention. I figured I would look into it.

Apparently this is also a standard model, but I hadn’t heard of it. A graph {G(n,m)} is generated by taking {n} vertices and randomly placing {m} edges in the graph. The most notable difference is that vertex pairs are selected with replacement, so there is a possibility of parallel edges. The problem from that assignment allows for self-loops, which is just insanity. I also don’t want to post solutions to other people’s problems. But mostly it is insanity, so we’ll stick to the more standard model.

Actually, some natural questions about this model are well-known results. We can ask for what value of {m} do we get a double edge with high probability; this is just the birthday paradox. We can ask for what value of {m} do we have a subgraph that is complete; this is coupon collector. Another question we can ask is if {m=n}, what is a bound on the maximum number of parallel edges?

That last one probably isn’t as common but it was actually a homework problem for me once [1]. It was phrased as follows: If we toss {n} balls into {n} bins by selecting the bin for each ball uniformly at random, show that no bin contains more than {O(\tfrac{\ln(n)}{\ln(\ln(n))})} balls with high probability, more specifically {1-\tfrac{1}{n}}.

This an exercise in Chernoff bounds, which are essential in basically every randomized algorithm I’ve ever seen. It places bounds on the probability that a random variable with a binomial distribution is really far from its mean. Actually it can be the sum of negatively associated random variables, but that’s not really important here. The really nice thing is that the bound is exponentially small. There are a bunch of different forms but they’re all derived in a similar manner. The version we’re going to use is best when we want to bound the probability that a random variable is really far from its mean. Moving past this extremely technical terminology, the bound states that {Pr(X \geq k\mu) \leq (\tfrac{e^{k-1}}{k^k})^{\mu}}.

The overall plan, as per usual, is to bound the probability that any bin has more than {O(\tfrac{\ln(n)}{\ln(\ln(n))})} and then complete the problem by union bounding. For this to work we want to aim to bound the probability a given bin by {\tfrac{1}{n^2}}. We let {X} be the number of balls in a given bin, {X_i} is {1} if the {i}th ball is thrown in the bin, {0} otherwise. Clearly {X = \sum{X_i}} and the {X_i} are independent, so we can apply Chernoff. I’m also going to be really lazy with constants, it’s definitely possible to be more precise. We see that {Pr(X \geq k\mu) \leq (\tfrac{e^{k-1}}{k^k})^{\mu}}, so if {\mu = 1} and {k \in O(\tfrac{\ln(n)}{\ln(\ln(n))})} then

{Pr(X \geq k) \leq \tfrac{e^{k-1}}{k^k} \leq \tfrac{e^k}{k^k} \approx \tfrac{1}{k^k} \leq \tfrac{1}{n^2}}

That last step wasn’t obvious at all to me, so here’s a short justification. Let {k = \tfrac{c\ln(n)}{\ln(\ln(n))}} and

{k^{-k} = (\tfrac{c\ln(n)}{\ln(\ln(n))})^{-\tfrac{c\ln(n)}{\ln(\ln(n))}} \leq (\sqrt{\ln(n)})^{-\tfrac{c\ln(n)}{\ln(\ln(n))}} = (\ln(n))^{-\tfrac{c\ln(n)}{2\ln(\ln(n))}} = (\exp(\ln(\ln(n))))^{-\tfrac{c\ln(n)}{2\ln(\ln(n))}} = \exp(-c\tfrac{\ln(n)}{2}) \leq n^{-2}}

At this point we’re done, since the union bound implies that the probability any bin has more than {O(\tfrac{\ln(n)}{\ln(\ln(n))})} balls is less than {n\tfrac{1}{n^2} = \tfrac{1}{n}} and so the probability that no bin has exceeded {O(\tfrac{\ln(n)}{\ln(\ln(n))})} is {1-\tfrac{1}{n} \rightarrow 1}.

That didn’t actually have anything to do with the random graph model specifically, I suppose, but the balls in bins idea has obvious applications elsewhere, like hashing and job assignments. Returning to {G(n,m)}, we’re just going to look at the number of isolated vertices in a very brief and incomplete fashion.

A vertex is isolated with probability {(1 - \tfrac{n-1}{\binom{n}{2}})^m = (1 - \tfrac{2}{n})^m} since it’s basically repeated attempts to hit one of the {n-1} edges out of {\binom{n}{2}} possible. Thus the expected number of isolated vertices is {n(1-\tfrac{2}{n})^m}. If we let {m = cn\ln(n)} and take the limit we see that

{\displaystyle\lim_{n \rightarrow \infty}{n(1-\tfrac{2}{n})^{c\ln(n)}} = \displaystyle\lim_{n \rightarrow \infty}{n}\displaystyle\lim_{n \rightarrow \infty}{(1-\tfrac{2}{n}}^{n})^{c\ln(n)} = \displaystyle\lim_{n \rightarrow \infty}{ne^{-2c\ln(n)}} = \displaystyle\lim_{n \rightarrow \infty}{n^{1-2c}}}

and so if {c > \tfrac{1}{2}} there is almost surely no isolated vertex. If this looks familiar it’s because it’s basically the same thing as {G(n,p)}. This is kind of obvious anyway. At least from the perspective of isolated vertices the fact that you can have multiple edges makes very little difference. The only change is the probability of getting an edge adjacent to a vertex is now fixed and we’re altering the number of attempts. The variance computation also appears to be identical, so I won’t put it here.

When I was trying to make my own modifications to {G(n,p)} for weighted graphs I was trying all kinds of strange things. I tried generating an edge with probability {p} and then selecting a weight from {[n]} with different distributions. Strange things happened. This is also a pretty nice and easy to understand model. I’ll come back to it once I have learned something about spectral graph theory.

Sources:
[1] I looked around for the website of the class I took the balls and bins problem from, but I can’t seem to find it. The class was Probability and Computing, 15-359, taught by professors Harchol-Balter and Sutner in the spring of 2012.

Loose Bounds on Connectivity in G(n,p)

One of the popular models for random graphs is Erdos-Renyi. We generate {G(n,p)} by taking {n} nodes and between every pair generating an edge with probability {p}. This is nice because it makes computing properties of the graph really easy. The downside is that very few structures actually follow this model – for instance in social networks if {u} and {v} are adjacent and {v} and {w} are adjacent, then {u} and {w} are more likely to be adjacent. This is clearly not the case in {G(n,p)}. It’s a shame, but at least it’s fun to play with.

A monotone property of a graph is one that is always preserved by adding more edges. For a monotone property {P} then it is a natural question to ask for what values of {p} does {G(n,p)} have {P} with probability one as {n} goes to infinity? Since the property is monotone there will be some function of {p = f(n)} at which point {G(n,f(n))} almost surely has property {P}. There are in fact sharp thresholds for many properties (as always for a more complete discussion of this , check out [1]). We’ll focus on connectivity, clearly a monotone property.

We’re going to take a shot at bounding the threshold value from below. Our first approach is an extremely loose bound. Let {X} be the random variable representing the number of edges in {G(n,p)} (we’ll just call it {G} from now on). If {E[X]} goes to zero then with high probability {G} has no edges at all. This is evident from Markov’s (Hopcroft and Kannan call it the First Moment Method). Since {P(X \geq a) \leq \tfrac{E[X]}{a}}, if {E[X]} drops to zero then the probability there are any edges at all drops to {0}. Since the expected number of edges in the graph is {E[X] = \binom{n}{2}p = \tfrac{n(n-1)p}{2}}, we see that if {p = \tfrac{c}{n^{2+\epsilon}}} where {\epsilon > 0} then {\lim_{n \rightarrow \infty}{E[X]} = \lim_{n \rightarrow \infty}{\tfrac{n(n-1)}{2}\tfrac{c}{n^{2+\epsilon}}} \leq \lim_{n \rightarrow \infty}{\tfrac{c}{2n^{\epsilon}}} = 0}. So if {p = n^{-(2+\epsilon)}} when {\epsilon > 0}, {G} almost surely has no edges and is therefore disconnected with high probability.

Actually we can do a little better than this even with our super-naive approach. We don’t need {E[X]} to drop to zero. A graph is disconnected if it has fewer than {n-1} edges, so if {Pr(X \geq n-1)} goes to zero, {G} is almost surely disconnected. Therefore all we need is that {E[X] \in o(n)}. This revision suggests that {p = \tfrac{c}{n^{1+\epsilon}}} will do the trick since {\lim_{n \rightarrow \infty}{E[X]} = \lim_{n \rightarrow \infty}{\tfrac{n(n-1)p}{2}} \leq \lim_{n \rightarrow \infty}{\tfrac{n^2p}{2}} = \lim_{n \rightarrow \infty}{\tfrac{cn^2}{2n^{1+\epsilon}}} = \lim_{n \rightarrow \infty}{\tfrac{cn^{1-\epsilon}}{2}}}, and so {\lim_{n \rightarrow \infty}{\tfrac{E[X]}{n-1}} = \lim_{n \rightarrow \infty}{n^{-\epsilon}}}. For {\epsilon > 0} this drops to {0} so {G} is almost surely disconnected.

It should come as no surprise to find that this is still a terrible, terrible bound. There’s a long way between having {n-1} edges and being connected, since a graph with {n-1} randomly placed edges is very unlikely to be connected. We can achieve a marginally better bound without doing too much work though. A graph is connected if and only if it has a spanning tree. If a graph almost surely does not have a spanning tree then it is disconnected with high probability. Let {X} be the number of spanning trees in {G}. If {X} drops to {0} then {G} almost surely does not have a spanning tree and so is almost surely disconnected. We know {E[X] = n^{n-2}p^{n-1}}. If {p = \tfrac{c}{n}} then {E[X] = \tfrac{c(np)^{n-1}}{n} = \tfrac{c}{n}}, which goes to {0}. So our new lower bound for {p} is {\tfrac{1}{n}}, better by literally an infinitesimal amount.

The actual optimal lower bound for {p} is {\tfrac{c \ln(n)}{n}} where {c < 1}. In fact, connectivity experiences what is called a sharp threshold. That is, there exists a {p = f(n)} such that {G} almost surely is disconnected for {p = cf(n)} where {c < 1} and is almost surely connected for {c > 1}. The proof is a little messy for my tastes, so we’ll just finish the process of bounding {p} from below. The following proof is just taken from Hopcroft and Kannan.

We will now investigate the expected number of isolated vertices (vertices with degree {0}). Clearly if {G} almost surely has an isolated vertex then {G} is almost surely disconnected. Let {X} be the number of isolated vertices, so {E[X] = n(1-p)^{n-1}}. If {p = \tfrac{c \ln(n)}{n}}, then {E[X] = n(1-p)^{n-1} = n(1-\tfrac{c\ln(n)}{n})^{n-1}}, so {\lim_{n \rightarrow \infty}{E[X]} = \lim_{n \rightarrow \infty}{n(1-\tfrac{c \ln(n)}{n})^n} = \lim_{n \rightarrow \infty}{ne^{-c\ln(n)}} = n^{1-c}}. So if {c > 1} then {G} almost surely has no isolated vertices.

Unfortunately this is not actually enough to tell us that for {c < 1} there are almost surely isolated vertices. It could be that most isolated vertices are all in some small set of graphs and the rest all have no isolated vertices. To fix this we need what Hopcroft and Kannan call the second moment method, which is basically using Chebyshev’s to show that if {\tfrac{Var(X)}{E^2[X]}} goes to {0} then {X} is almost surely not zero since {Pr(X=0) \leq Pr(|X - E[X]|) \leq \tfrac{Var(X)}{E^2[X]}}. An immediate consequence of this by the definition of variance is that if {E[X^2] \leq E^2[X](1+o(1))} then {X} is almost surely greater than zero.

This is generally a little messier since variance computations are not quite as nice as computing the mean, but in this case it isn’t too bad. We want {E[X^2]}. Splitting it up into the typical indicator random variables gives us {E[X^2] = E[(x_1 + \cdots x_n)^2] = \sum_{i=1}^{n}{x_i^2} + 2\sum_{i < j}{x_ix_j}}. Since {x_i^2 = x_i}, that part is {E[X]}. There are {n(n-1)/2} terms in the second sum and each term is {(1-p)^{2n-2-1}} where we avoided double-counting the edge between vertices {i} and {j}. Now we can just compute {\tfrac{E[X^2]}{E^2[X]}}, substitute in {p = \tfrac{c\ln(n)}{n}} where {c < 1} and send it to infinity.

\begin{array}{ccc}  \frac{E[X^2]}{E^2[X]} &=& \frac{E[X] + n(n-1)(1-p)^{2n-3}}{E^2[X]} \\  &=& \frac{1}{E[X]} + \frac{n(n-1)(1-p)^{2n-3}}{n^2(1-p)^{2n-2}}\\  &=& \frac{1}{E[X]} + (1-\frac{1}{n})\frac{1}{1-p}\\  &=& 1+o(1)\\  \end{array}

So {G} almost surely has an isolated vertex is {c < 1}.

As mentioned before, if {p = c\frac{\ln(n)}{n}} and {c > 1}, it is possible to show that {G} is almost surely connected (instead of that it almost surely has no isolated vertices). This is shown by Hopcroft and Kannan by considering the “giant component.” In {G(n,p)} as {p} increases a single component contains most ({n^{2/3}}) of the vertices, and the rest are isolated vertices. Once {p} is {\tfrac{c\ln(n)}{n}} where {c > 1} the isolated vertices disappear because the giant component has swallowed the whole graph.

Erdos-Renyi is an interesting model to play around with, mostly because it’s the only model I can easily follow most of the math for. For a while I was trying to apply some basic spectral graph theory techniques to it, but I couldn’t make it hold up except in the most basic of facts. For instance the expected Laplacian of {G(n,p)} is obvious, and if you apply Kirchhoff’s Theorem to it then you get the expected number of spanning trees. What I couldn’t do was figure out the expected second smallest eigenvalue, which is zero if and only if {G} is connected. Perhaps a little more creativity was needed on that front. Hopefully I get a chance to review it.

Sources:
[1] Foundations of Data Science by Hopcroft and Kannan. A draft can be found at http://www.cs.cornell.edu/jeh/book112013.pdf
Used for the proof of the sharp threshold for the existence of isolated vertices.