You are on page 1of 111

Probability and Algorithms, Caltech CS150, Fall 2018

Leonard J. Schulman
Schulman: CS150 2018

Syllabus
The most important questions of life are, for the most part, really only problems of probability.
Strictly speaking one may even say that nearly all our knowledge is problematical; and in the
small number of things which we are able to know with certainty, even in the mathematical
sciences themselves, induction and analogy, the principal means for discovering truth, are based
on probabilities, so that the entire system of human knowledge is connected with this theory.
Pierre-Simon Laplace
Introduction to Theorie Analytique des Probabilities. Oeuvres, t. 7. Paris, 1886, p. 5.

Class time: MWF10:00-11:00 in Annenberg 314. Office hours will be in my office, Annenberg 317.
My office hours are by appointment at the beginning of the term, after that I’ll fix a regular time
(but appointments will still be fine). Starting Oct 12: my OH are on Fridays at 11:00. TA: Jenish
Mehta, jenishc@gmail.com. TA Office Hours: weeks that an assignment is due: M 7:00pm Ann 121;
off weeks: W 7:00pm Ann 107. (There is a calendar on the class web page that includes all this
information.) Problem sets are due to Jenish’s mailbox by W 6:00pm.
There will be problem sets, due on Wednesdays; there will not be an exam. You may collaborate
with other students on the sets; just make a note of the extent of collaboration and with whom (truly
joint work with Alice, Bob helped me on this problem, etc.). This is assuming it’s a collaboration
and doesn’t regularly become one-way; if you feel that happening, (a) focus on doing a decent
fraction of the problems on your own or with consultation with me or the TA, (b) don’t collaborate
until after you’ve already spent some time thinking about the problem yourself.
Lecture notes will be posted after the fact.
The topics covered during the quarter are listed in the table of contents.
Some other topics not reached that I’ll try to cover when I add a second quarter to the course: Ran-
domized vs. Distributional Complexity. Game tree evaluation: upper and lower bounds. Karger’s
min-cut algorithm. Hashing, AKS dictionary hashing, cuckoo hashing. Power of two choices.
Talagrand concentration inequality. Linial-Saks graph partitioning. A. Kalai’s sampling random
factored numbers. Feige leader election. Approximation of the permanent and self-reducibility.
Equivalence of approximate counting and approximate sampling. ε-biased k-wise independent
spaces. #DNF-approximation. Shamir secret sharing. An interactive proof for a problem only
known to be in coNP: graph non-isomorphism. Searching for the first spot where two sequences
disagree. Weighted sampling (e.g., Karger network reliability). Markov Chain Monte Carlo.
Notes. (1) This course can be only an exposure to probability and its role in the theory of algorithms.
We will stay focused on key ideas and examples; we will not be overconcerned with best bounds.
(2) I assume this is not your first exposure to probability. Likewise I’ll assume you have some
familiarity with algorithms. However, the first lecture will start out with some basic examples and
definitions.
Books. There will be no assigned book, but I recommend the following references:
On reserve at SFL:

• Mitzenmacher & Upfal, Probability and Computing, Cambridge 2005


• Motwani & Raghavan, Randomized Algorithms, Cambridge 1995
• Williams, Probability with Martingales, Cambridge 1991
• Alon & Spencer, The Probabilistic Method, 4th ed., Wiley 2016

Not on reserve:

1
Schulman: CS150 2018

• Adams & Guillemin, Measure Theory and Probability, Birkhäuser 1996

• Billingsley, Probability and Measure, 3rd ed., Wiley 1995

2
Contents

1 Some basic probability theory 7


1.1 Lecture 1 (3/Oct): Appetizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Lecture 2 (5/Oct) Some basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.1 Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.2 Measurable functions, random variables and events . . . . . . . . . . . . . . . . 10
1.3 Lecture 3 (8/Oct): Linearity of expectation, union bound, existence theorems . . . . . 13
1.3.1 Countable additivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.2 Coupon collector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.3 Application: the probabilistic method . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.4 Union bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.5 Using the union bound in the probabilistic method: Ramsey theory . . . . . . 15
1.4 Lecture 4 (10/Oct): Upper and lower bounds . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.1 Bonferroni inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.2 Tail events: Borel-Cantelli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.3 B-C II: a partial converse to B-C I . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5 Lecture 5 (12/Oct): More on tail events: Kolmogorov 0-1, random walk . . . . . . . . 20
1.6 Lecture 6 (15/Oct): More probabilistic method . . . . . . . . . . . . . . . . . . . . . . . 23
1.6.1 Markov inequality (the simplest tail bound) . . . . . . . . . . . . . . . . . . . . 23
1.6.2 Variance and the Chebyshev inequality: a second tail bound . . . . . . . . . . 23
1.6.3 Power mean inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.6.4 Large girth and large chromatic number; the deletion method . . . . . . . . . . 24
1.7 Lecture 7 (17/Oct): FKG inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.8 Lecture 8 (19/Oct) Part I: Achieving expectation in MAX-3SAT. . . . . . . . . . . . . . 29
1.8.1 Another appetizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.8.2 MAX-3SAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.8.3 Derandomization by the method of conditional expectations . . . . . . . . . . 29

3
Schulman: CS150 2018 CONTENTS

2 Algebraic Fingerprinting 32
2.1 Lecture 8 (19/Oct) Part II: Fingerprinting with Linear Algebra . . . . . . . . . . . . . . 32
2.1.1 Polytime Complexity Classes Allowing Randomization . . . . . . . . . . . . . 32
2.1.2 Verifying Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2 Lecture 9 (22/Oct): Fingerprinting with Linear Algebra . . . . . . . . . . . . . . . . . . 35
2.2.1 Verifying Associativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3 Lecture 10 (24/Oct): Perfect matchings, polynomial identity testing . . . . . . . . . . . 37
2.3.1 Matchings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.2 Bipartite perfect matching: deciding existence . . . . . . . . . . . . . . . . . . . 37
2.3.3 Polynomial identity testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.4 Lecture 11 (26/Oct): Perfect matchings in general graphs. Parallel computation. Iso-
lating lemma. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.4.1 Deciding existence of a perfect matching in a graph . . . . . . . . . . . . . . . . 41
2.4.2 Parallel computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.4.3 Sequential and parallel linear algebra . . . . . . . . . . . . . . . . . . . . . . . . 42
2.4.4 Finding perfect matchings in general graphs. The isolating lemma . . . . . . . 43
2.5 Lecture 12 (29/Oct.): Isolating lemma, finding a perfect matching in parallel . . . . . 44
2.5.1 Proof of the isolating lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.5.2 Finding a perfect matching, in RNC . . . . . . . . . . . . . . . . . . . . . . . . . 44

3 Concentration of Measure 47
3.1 Lecture 13 (31/Oct): Independent rvs, Chernoff bound, applications . . . . . . . . . . 47
3.1.1 Independent rvs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.1.2 Chernoff bound for uniform Bernoulli rvs (symmetric random walk) . . . . . 47
3.1.3 Application: set discrepancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1.4 Entropy and Kullback-Liebler divergence . . . . . . . . . . . . . . . . . . . . . . 50
3.2 Lecture 14 (2/Nov): Stronger Chernoff bound, applications . . . . . . . . . . . . . . . 51
3.2.1 Chernoff bound using divergence; robustness of BPP . . . . . . . . . . . . . . . 51
3.2.2 Balls and bins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2.3 Preview of Shannon’s coding theorem . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3 Lecture 15 (5/Nov): Application of large deviation bounds: Shannon’s coding theo-
rem. Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.1 Shannon’s block coding theorem. A probabilistic existence argument. . . . . . 55
3.3.2 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4 Lecture 16 (7/Nov): Application of CLT to Gale-Berlekamp. Khintchine-Kahane.
Moment generating functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4.1 Gale-Berlekamp game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4.2 Moment generating functions, Chernoff bound for general distributions . . . . 59
3.5 Lecture 17 (9/Nov): Johnson-Lindenstrauss embedding `2 → `2 . . . . . . . . . . . . . 62
3.5.1 Normed spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4
Schulman: CS150 2018 CONTENTS

3.5.2 JL: the original method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64


3.5.3 JL: a similar, and easier to analyze, method . . . . . . . . . . . . . . . . . . . . . 66
3.6 Lecture 18 (12/Nov): cont. JL embedding; Bourgain embedding . . . . . . . . . . . . . 67
3.6.1 cont. JL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.6.2 Bourgain embedding X → L p , p ≥ 1 . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.6.3 Embedding into L1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.7 Lecture 19 (14/Nov): cont. Bourgain embedding . . . . . . . . . . . . . . . . . . . . . . 71
3.7.1 cont. Bourgain embedding: L1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.7.2 Embedding into any L p , p ≥ 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.7.3 Aside: Hölder’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4 Limited independence 74
4.1 Lecture 20 (16/Nov): Pairwise independence, Shannon coding theorem again, second
moment inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.1.1 Improved proof of Shannon’s coding theorem using linear codes . . . . . . . . 74
4.1.2 Pairwise independence and the second-moment inequality . . . . . . . . . . . 76
4.2 Lecture 21 (19/Nov): G (n, p) thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.2.1 Threshold for H as a subgraph in G (n, p) . . . . . . . . . . . . . . . . . . . . . . 78
4.2.2 Most pairs independent: threshold for K4 in G (n, p) . . . . . . . . . . . . . . . 78
4.3 Lecture 22 (21/Nov): Concentration of the number of prime factors; begin Khintchine-
Kahane for 4-wise independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3.1 4-wise independent random walk . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.4 Lecture 23 (26/Nov): Cont. Khintchine-Kahane for 4-wise independence; begin MIS
in NC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.4.1 Paley-Zygmund: solution through an in-probability bound . . . . . . . . . . . 84
4.4.2 Berger: a direct expectation bound . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.4.3 cont. proof of Theorem 73 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.4.4 Maximal Independent Set in NC . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.5 Lecture 24 (28/Nov): Cont. MIS, begin derandomization from small sample spaces . 89
4.5.1 Cont. MIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.5.2 Descent Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.5.3 Cont. MIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.5.4 Begin derandomization from small sample spaces . . . . . . . . . . . . . . . . . 91
4.6 Lecture 25 (30/Nov): Limited linear independence, limited statistical independence,
error correcting codes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.6.1 Generator matrix and parity check matrix . . . . . . . . . . . . . . . . . . . . . 92
4.6.2 Constructing C from M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.6.3 Proof of Thm (87) Part (1): Upper bound on the size of k-wise independent
sample spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.6.4 Back to Gale-Berlekamp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.6.5 Back to MIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5
Schulman: CS150 2018 CONTENTS

5 Lovász local lemma 96


5.1 Lecture 26 (3/Dec): The Lovász local lemma . . . . . . . . . . . . . . . . . . . . . . . . 96
5.2 Lecture 27 (5/Dec): Applications and further versions of the local lemma . . . . . . . 100
5.2.1 Graph Ramsey lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2.2 van der Waerden lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2.3 Heterogeneous events and infinite dependency graphs . . . . . . . . . . . . . . 101
5.3 Lecture 28 (7/Dec): Moser-Tardos branching process algorithm for the local lemma . 103

Bibliography 106

6
Chapter 1

Some basic probability theory

1.1 Lecture 1 (3/Oct): Appetizers


1. Measure the length of a long string coiled under a glass tabletop. You have an ordinary rigid
ruler (longer than sides of the table).
2. N gentlemen check their hats in the lobby of the opera, but after the performance the hats are
handed back at random. How many men, on average, get their own hat back?
3. The coins-on-dots problem: On the table before us are 10 dots, and in our pocket are 10
nickels. Prove the coins can be placed on the table (no two overlapping) in such a way that all
the dots are covered.

4. Birthday Paradox. I just remind you of this: a class of 23 students has better than even odds
of some common birthday. (Supposing birthdates are uniform on 365 possibilities.) The exact
calculation is
365 · · · 343 ∼
Pr(some common birthday) = 1 − = 0.507297
36523
but a better way to understand this is that the number of ways this can happen is (2k ) for
k students; so long as these events don’t start heavily overlapping, we can almost add their
1
probabilities (which are each just 365 ). We’ll be more formal about the upper and lower
bounds soon.
5. The envelope swap paradox: You’re on a TV game show and the host offers you two identical-
looking envelopes, each of which contains a check in your name from the TV network. You
pick whichever envelope you like and take it, still unopened.
Then the host explains: one of the checks is written for a sum of $N (N > 0), and the other is
for $10N. Now, he says, it’s 50-50 whether you selected the small check or the big one. He’ll
give you a chance, if you like, to swap envelopes. It’s a good idea for you to swap, he explains,
because your expected net gain is (with $m representing the sum currently in hand):

E(gain) = (1/2)(10m − m) + (1/2)(m/10 − m) = (81/20)m

How can this be?

6. Consider a certain society in which parents prefer female offspring. Can a couple increase
their expected fraction of daughters by halting reproduction after the first?

7
Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

Let’s just make explicit here that we are not using advanced medical technologies. That is
to say, the couple can control whether they create a pregnancy, but no other property of the
fetus.
Before moving on we note that this is closely related to a famous problem which we will
return to: the gambler’s ruin problem. A gambler starts with $1 in his pocket and repeatedly
risks $1 on a fair coin toss, until he goes broke. He is very likely to go broke, right? Indeed if
he sticks around indefinitely, he will go broke with probability 1. But that is exactly equivalent
(when boys and girls are equiprobable) to the event of a sufficiently large family having an
excess of girls over boys.
Why are our intuitions so opposite in these two cases? It has to do with the fact that we
clearly internalize the finiteness of the family size, whereas we can easily imagine the gambler
addicitively playing for an extraordinarily long time. So his high-probability doom impresses
us. If we decide in advance that we will stop him after one million plays, whether or not
he has stopped himself by that time, then his expected wealth at that time is equal to $1,
even though he has almost certainly lost his $1 and gone broke; there’s a small chance he has
earned a lot of money.
7. Unbalancing lights: You’re given an n × n grid of lightbulbs. For each bulb, at position (i, j),
there is a switch bij ; there is also a switch ri on each row and a switch c j on each column. The
(i, j) bulb is lit if bij + ri + c j is odd.
What is the greatest f (n) such that for any setting to the bij ’s, you can set the row and column
switches to light at least n2 /2 + f (n) bulbs?

Now, we haven’t yet defined either random variables or expectations, but I think you likely already
have a feel for these concepts, so let’s see how linearity of expectation already resolves several of
our appetizers. If you’re not sure how to be rigorous about this, no worries, we’ll proceed more
methodically in the next lecture.

(1): Let the tabletop be the rectangle [− a, a] × [−b, b]. Set r = a2 + b2 . Choose θ uniformly
in [0, π ) and z uniformly in [−r, r ]. Lay the ruler along the affine line of points ( x, y) satisfying
x cos θ + y sin θ = z. Count the number of times the ruler crosses the string.
Since we have in mind a physical string, mathematically we can model it as differentiable, and
therefore the number of intersections is equal to the limit in which we decompose the string into
short straight segments. A ruler can intersect such a segment only 0 or 1 times (apart from a
probability 0 event of aligning perfectly). Observe that our process with the ruler is such that
no matter where a straight segment lies on the table, the probability of the ruler intersecting it is
proportional to its length.
Applying linearity of expectation, we conclude that the total length is proportional to the expected
number of intersections. We skip calculating the constant of proportionality.
(2): The probability that each gentleman gets his hat back is 1/N. So the expected number of his
hats that he receives (this can be only 0 or 1) is 1/N. By linearity of expectation, the expected
number of hats restored to their proper owners overall, is ∑1N 1/N = 1. (Note, if you know about
independence of random events, the events corresponding to success of the various gentlemen are
not independent! But that doesn’t matter to since we are only adding expectations.)

8
Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

1.2 Lecture 2 (5/Oct) Some basics

1.2.1 Measure
Frequently one can “get by” with a naı̈ve treatment of probability theory: you can treat random
variables quite intuitively so long as you maintain Bayes’ law for conditional probabilities of events:

Pr( A1 ∩ A2 )
Pr( A1 | A2 ) =
Pr( A2 )

However, that’s not good enough for all situations, so we’re going to be more careful, and me-
thodically answer the question, “what is a random variable?” (For a philosophical and historical
discussion of this question see Mumford in [70].)
First we need measure spaces. Let’s start with some standard examples.

1. Z with the counting measure.


2. R with the Lebesgue measure, i.e., the measure (general definition momentarily) in which
intervals have measure proportional to their length: µ([ a, b]) = b − a for b ≥ a.
3. [0, 1] with the Lebesgue measure.

4. A finite set with the uniform probability measure.

As we see, a measure µ assigns a real number to (some) subsets of a universe; if, as in the last two
examples, we also have
µ(universe) = 1 (1.1)
then we say the measure space is a probability space or a sample space.
Let’s see what are the formal properties we want from these examples.
As we just hinted, we don’t necessarily assign a measure to all subsets of the universe; only to the
measurable sets. In order to make sense of this, we need to define the notion of a σ-algebra (also
known as a σ-field).
A σ-algebra ( M, M̃ ) is a set M along with a collection M̃ of subsets of M (called the measurable sets)
which satisfy: (1) ∅ ∈ M̃, and (2) M̃ is closed under complement and countable intersection.
It follows also that M ∈ M̃ and M̃ is closed under countable union (de Morgan). By induction this
gives a stability property: we can take any finite sequence of the form, a countable union of countable
intersections of . . . of countable unions of measurable sets, and the result will be a measurable set.
A measure space is a σ-algebra ( M, M̃ ) together with a measure µ, which is a function

µ : M̃ → [0, ∞] (1.2)

that is countably additive, that is, for any pairwise disjoint S1 , S2 , . . . ∈ M̃,

∑ µ ( Si ) .
[
µ( Si ) = (1.3)

So, (1.2) and (1.3) give us a measure space, and if we also assume (1.1) then we have a probability
space.
Let us see some properties of measure spaces:
I. µ(∅) = 0 since µ(∅) + µ(∅) = µ(∅ ∪ ∅) = µ(∅).

9
Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

II. The modular identity µ(S) + µ( T ) = µ(S ∩ T ) + µ(S ∪ T ) holds because necessarily S − T, T − S
and S ∩ T are measurable, and both sides of the equation may be decomposed into the same linear
combination of the measures of these sets. (The set S − T is S ∩ (¬ T ).) This identity is sometimes
also called the lattice or valuation property.
III. From the modular identity and nonnegativity, S ⊆ T ⇒ µ(S) ≤ µ( T ).

1.2.2 Measurable functions, random variables and events


A measurable function is a mapping X from one measure space, say ( M1 , M̃1 , µ1 ), into another, say
( M2 , M̃2 , µ2 ), such that pre-images of measurable sets are measurable, that is to say, if T ∈ M̃2 , then
X −1 ( T ) ∈ M̃1 .
If M1 is a probability space we call X a random variable.
The range of the random variable, M2 , can be many things, for example:

• M2 = R, with the σ-field consisting of rays ( a, ∞), rays [ a, ∞), and any set formed out of these
by closing under the operations of complement and countable union. (In CS language, any
other measurable set is formed by a finite-depth formula whose leaves are rays of the afore-
mentioned type, and each internal node is either a complementation or a countable union.)
Sometimes it is convenient to use the “extended real line,” the real line with ∞ and −∞
adjoined, as the base set.
• M2 = names of people eligible for a draft which is going to be implemented by a lottery. The
σ-field here is 2 M2 , namely the power set of M2 .

• M2 = deterministic algorithms for a certain computational problem, with the counting mea-
sure. On a countably infinite set M2 , just as on a finite set, we can use the power set as the
σ-field. The counting measure assigns to S ⊆ M2 its cardinality |S|.

Events
With any measurable subset T of M2 we associate the event X ∈ T; if X is understood, we simply
call this the event T. This event has the probability Pr( X ∈ T ) (or if X is understood, Pr( T )) dictated
by

Pr( X ∈ T ) = µ1 ( X −1 ( T )). (1.4)

The indicator of this event is the function JTK or IT ,

IT : M1 → {0, 1} ⊆ R
(
1 y ∈ X −1 ( T )
IT (y) =
0 otherwise

The basic but key property is that


Z
Pr( X ∈ T ) = IT dµ = E(IT ). (1.5)

It follows that probabilities of events satisfy:

1. Pr(∅) = 0 (“the experiment has an outcome”)

10
Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

2. Pr( M2 ) = 1 (“the experiment has only one outcome”)


3. Pr( A) ≥ 0
4. Pr( A) + Pr( B) = Pr( A ∩ B) + Pr( A ∪ B)

Note that events can themselves be thought of as random variables taking values in {0, 1}; indeed
we will sometimes define an event directly, rather than creating out of some other random variable
X and subset T of the image of X.
For the most part we will sidestep measure theory—one needs it to cure pathologies but we will be
studying healthy patients. However I recommend Adams and Guillemin [3] or Billingsley [16].
Often when studying probability one may suppress any mention of the sample space in favor of
abstract axioms of probability. For us the situation will be quite different. While starting out as a
formality, explicit sample spaces will soon play a significant role.
Joint distributions
Given two random variables X1 : M → M1 , X2 : M → M2 (where each Mi has associated with it a
σ-field ( Mi , M̃i )), we can form the “product” random variable ( X1 , X2 ) : M → M1 × M2 . The same
goes for any countable collection of rvs on M, and it is important that we can do this for countable
collections; for instance we want to be able to discuss unbounded sequences of coin tosses. Given a
product rv
( X1 , X2 , . . .) : M → M1 × M2 × . . . ,
its marginals are probability distributions on each of the measure spaces Mi . These distributions are
defined by, for A ∈ M̃i ,

Pr( Xi ∈ A) = Pr(( X1 , X2 , . . .) ∈ M1 × M2 × . . . × Mi−1 × A × Mi+1 × . . .

That is, you simply ignore what happens to the other rvs, and assign to set A ∈ M̃i the probability
µ( Xi−1 ( A)).
X1 , X2 , . . . are independent if for any finite S = {s1 , . . . , sn } and any As1 ∈ M̃s1 , . . . , Asn ∈ M̃sn , we
have Pr(( Xs1 , . . . , Xsn ) ∈ As1 × · · · × Asn ) = Pr( Xs1 ∈ As1 ) · · · Pr( Xsn ∈ Asn ).
(Note that Pr(( X1 , X2 ) ∈ A1 × A2 ) is just another way of writing Pr(( X1 ∈ A1 ) ∧ ( X2 ∈ A2 )).)
Example: a pair of fair dice.
Let M be the set of 36 ways in which two dice can roll, each outcome having probability 1/36. On
this sample space we can define various useful functions: e.g., Xi = the value of die i (i = 1, 2);
Y = X1 + X2 . X1 and X2 are independent; X1 and Y are not independent.
X1 , . . . : M → T are independent and identically distributed (iid) if they are independent and all
marginals are identical. If T is finite and the marginals are the uniform distribution, we say that
the rv’s are uniform iid. We use the same terminology in case T is infinite but of finite measure
(e.g., Lebesgue measure on a compact set), and the marginal on T is the probability distribution
proportional to this measure on T.
Conditional Probabilities are defined by

Pr( X ∈ A ∩ B)
Pr( X ∈ A| X ∈ B) = (1.6)
Pr( X ∈ B)
provided the denominator is positive.
An old example. You meet Mr. Smith and find out that he has exactly two children, at least one of
which is a girl. What is the probability that both are girls? Answer1 : 1/3.
1 As usual in such examples we suppose that the sexes of the children are uniform iid. Some facts from general knowledge

should be enough for you to doubt both uniformity and independence.

11
Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

Taking (1.6) and applying induction, we have that if Pr( A1 ∩ . . . An ) > 0, then:
Chain rule for conditional probabilities

Pr( A1 ∩ . . . An ) = Pr( An | A1 ∩ . . . An−1 ) · Pr( An−1 | A1 ∩ . . . An−2 )


· · · Pr( A2 | A1 ) · Pr( A1 ).

(If Pr( A1 ∩ . . . An ) = 0 then because of the denominators, some of the conditional probabilities in
the chain might not be well defined, but you can say that either Pr( A1 ) = 0 or there is some i s.t.
Pr( Ai | A1 ∩ . . . Ai−1 ) = 0.)
Real-valued random variables; expectations
If X is a real-valued rv on a sample space with measure µ, its expectation (aka average, mean or
first moment) is given by the following integral
Z
E( X ) = X dµ

which is defined in the Lebesgue manner by2


Z
X dµ = lim ∑
h→0 integer
jh Pr( jh ≤ X < ( j + 1)h) (1.7)
−∞< j<∞

provided this limit converges absolutely, which means that

sup ∑ | jh| Pr( jh ≤ X < ( j + 1)h) < ∞. (1.8)


h >0 integer
−∞< j<∞

It is not hard to innocently encounter cases where the integral is not defined. Stand a meter from an
infinite wall, holding a laser pointer. Spin so you’re pointing at a uniformly random orientation. If
the laser pointer is not shining at a point on the wall (which happens with probability 1/2), repeat
until it does. The displacement of the point you’re pointing at, relative to the point closest to you
on the wall, is tan α meters for α uniformly distributed in (−π/2, π/2). You could be forgiven for
thinking the average displacement “ought” to be 0, but the integral does not converge absolutely,
R cos(π/2) 1 cos(π/2)
= +∞, using the substitution x =
R π/2
because 0 tan α dα = − cos(0) x dx = −[log x ]|cos(0)
cos α.
To see the kind of problem that this can create, consider that for an integration definition to make
Rb
sense, we ought to have the property that if lim am = a and lim bm = b, then lim amm f (α)dα =
Rb
a f ( α ) dα. But in theR present circumstance we can, for instance, take am = − arccos 1/m, bm =
b
arccos 2/m, and then amm tan α dα = − log m + log 2m = log 2 (rather than 0).
Nonnegative integrands. As we see in (1.8), it is essential to be able to characterizeRwhether an

integral of a nonnegative function converges. (That equation is a discretization of −∞ | X | dµ.)
It
R is worth pointing out that for probability measure µ supported on the nonnegative integers,
x dµ( x ) = ∑n≥1 nµ({n}) = ∑n≥1 µ({n, . . .}). So the integral converges iff the sequence µ({n, . . .})
has a finite sum. Exercise: State and verify the analogous statement when µ is supported on the
nonnegative reals.
2 One can be more scrupulous about the measure theory; see the suggested references. But he knew little out of his way, and

was not a pleasing companion; as, like most great mathematicians I have met with, he expected universal precision in everything said, or
was for ever denying or distinguishing upon trifles, to the disturbance of all conversation. He soon left us.
The Autobiography of Benjamin Franklin, chapter 5

12
Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

1.3 Lecture 3 (8/Oct): Linearity of expectation, union bound, exis-


tence theorems
Let’s return to one of our appetizers, the coins-on-dots problem (3): I don’t want to give this away
entirely, but here’s a hint: what is the fraction of the plane covered by unit disks packed in a
hexagonal pattern?

1.3.1 Countable additivity


Back to the theory.
If we have two real-valued rvs X, Y on the same sample space, we can form their sum rv X + Y. No
matter the joint distribution of X and Y, we have, providing their expectations are well defined:

E ( X + Y ) = E ( X ) + E (Y ) linearity of expectation

for the simple reason that expectation is a first moment. You have only to verify:
R R R
Exercise: Absolute convergence of X dµ and Y dµ implies absolute convergence of ( X + Y ) dµ.
Because | X + Y | dµ ≤ | X | + |Y | dµ < ∞.
R R

In the nonnegative case we have also countable additivity:


Exercise: Let X1 , . . . be nonnegative real-valued with expectations E( Xi ). Then

E ( ∑ Xi ) = ∑ E ( Xi ) .

1.3.2 Coupon collector


There are n distinct types of coupons and you want to own the whole set. Each draw is uniformly
distributed, no matter what has happened earlier. What is the expected time to elapse until you
own the set?
Think of the coupons being sampled at times 1, 2, . . .. Let Yi = the first time at which we are in state
Si , which is when we have seen exactly i different kinds of coupons (i = 0, . . . , n). So Y0 = 0, Y1 = 1.
Let Xi = Yi − Yi−1 . In state Si−1 , in each round there is probability (n − i + 1)/n that we see a
new kind of coupon, until that finally happens. That is to say, Xi is geometrically distributed with
parameter pi = (n − i + 1)/n. We can work out E( Xi ) from the geometric sum, but there’s a slicker
way.
If we’re in state Si−1 , then with probability (n − i + 1)/n we’re in Si in one more time step, else
we’re back in the same situation.

1/n 2/n 1−1/n 1



1−1/n 1−2/n
S0
1 / S1 / S2 / ··· 2/n
/ Sn −1 1/n
/ Sn (1.9)

So
n−i+1 i−1
E ( Xi ) = 1+ ·0+ · E ( Xi ) (1.10)
n n
n−i+1
E ( Xi ) = 1 (1.11)
n
n
E ( Xi ) = (1.12)
n−i+1

13
Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

Now we have:
n
E(Yn ) = ∑ E ( Xi )
1
n
n
= ∑ n−i+1
1
n
1
= n∑
1
i
= nHn
= n(log n + O(1))

1.3.3 Application: the probabilistic method


A tournament of size n is a directed complete graph. We may think of a tournament T equivalently
as a skew-symmetric mapping T : [n] × [n] → {1, 0, −1} that is 0 only on the diagonal.3
A Hamilton path in a tournament (or a digraph more generally) is a directed simple path through
all the vertices.

Lemma 1 There exists a tournament with at least n!2−n+1 Hamilton paths.

This certainly isn’t true for all tournaments—as an extreme case, the totally ordered tournament
has only one H-path.
Proof: This is an opportunity to consider a nice random variable: the random tournament. You
simply fix n vertices, and direct each edge between them uniformly iid.
Any particular permutation of the vertices has probability 2−n+1 of being a H-path, so the expec-
tation of the indicator rv for this event is 2−n+1 . The indicator rvs are far from independent, but
anyway, by linearity of expectation, the expected number of H-paths is n!2−n+1 . So some tourna-
ment has at least this many H-paths. 2
Exercise: explicit construction.
Describe a specific tournament with n!(2 + o (1))−n Hamilton paths.

1.3.4 Union bound


Pr( A ∪ B) = Pr( A) + Pr( B) − Pr( A ∩ B) ≤ Pr( A) + Pr( B)
The bound applies also to countable unions:
S∞
Lemma 2 Pr( 1 Ai ) ≤ ∑1∞ Pr( Ai ).

Proof: First note that by induction the bound applies to any finite union. Now, if the right-hand
side is at least 1, the result is immediate. If not, consider any counterexample; since the sequences
Pr( 1k Ai ) and ∑1k Pr( Ai ) each monotonically converge to their respective limit, then there is a finite
S

k for which Pr( 1k Ai ) > ∑1k Pr( Ai ). Contradiction. 2


S

Later in the lecture we’ll use the following which, while trivial, has the whiff of assigning a value
to ∞/∞:
3 We frequently use the notation [n] = {1, . . . , n}.

14
Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

S
Corollary 3 If a countable list of events A1 , . . . all satisfy Pr( Ai ) = 0, then Pr( Ai ) = 0. Likewise if for
T
all i, Pr( Ai ) = 1, then Pr( Ai ) = 1.

Now let’s revisit the birthday paradox (4). For a year of n days and a class of r students, let the rv
B = the number of pairs of students who share a birthday.
 
r 1
E( B) =
2 n

which suggests
√ that the probability of some joint birthday may be a constant once r is large enough
that r ∼ n. We can easily verify the correctness of one side of this claim. The event of there
being someS
common birthday is [ B > 0]. With Bij being the event students i, j share a birthday,
[ B > 0] = i< j Bij , so by the union bound,
 
r 1
Pr( B > 0) ≤ .
2 n

This shows that r ∈ o ( n) ⇒ Pr( B > 0) ∈ o (1).
The converse holds, too; fundamentally this is because there is not much overlap in the sample
space between the (2r ) different events. We postpone this for now but will show below how to carry
out this argument.

1.3.5 Using the union bound in the probabilistic method: Ramsey theory
Theorem 4 (Ramsey [74]) Fix any nonnegative integers k, `. There is a finite “Ramsey number” R(k, `)
such that every graph on R(k, `) vertices contains either a clique of size k or an independent set of size `.
Specifically, R(k, `) ≤ (k+`− 2
`−1 ).

(The finiteness is due to Ramsey [74] and the bound to Erdös and Szekeres [31].)
Numerous generalizations of Ramsey’s argument have since been developed—see the book [44].
Proof: (of Theorem (4)) This is outside our main line of development but we include it for com-
pleteness. First, R(k, 1) = R(1, k ) = 1 = (k−0 1). Now if k, ` > 1, consider a graph with R(k, ` − 1) +
R(k − 1, `) + 1 vertices and pick any vertex v. Let VY denote the vertices connected to v by an edge,
and let VN denote the remaining vertices. Either |VN | ≥ R(k, ` − 1) or |VY | ≥ R(k − 1, `).
If |VN | ≥ R(k, ` − 1) then either the graph spanned by VN contains a k-clique or the graph spanned
by VN ∪ {v} contains an independent set of size `.
On the other hand if |VY | ≥ R(k − 1, `) then either the graph spanned by VY ∪ {v} contains a
k-clique or the graph spanned by VY contains an independent set of size `.
Applying this argument and induction on k + `, we have: R(k, `) ≤ R(k, ` − 1) + R(k − 1, `) ≤
(k+`− 3 k +`−3 k +`−2
`−2 ) + ( `−1 ) = ( `−1 ). (The final equality counts subsets of [ k + ` − 2] of size ` − 1 according
to whether the first item is selected.)
−2

If you apply Stirling’s approximation, this gives the bound R(k, k) ≤ (2k k
k −1 ) ∈ O (4 / k ).
In the intervening nearly-a-century there have been some improvements on this bound, first by
log k
−Ω( log log k )
Rödl [43], then by Thomason [85], and most recently by Conlon [20] to R(k, k) ≤ 4k k .
What we use the union bound for is to show a converse:
k
Theorem 5 (Erdös [28]) If (nk) < 2(2)−1 then R(k, k ) > n. Thus R(k, k ) ≥ (1 − o (1)) k
√ 2k/2 .
e 2

15
Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

This leaves an exponential gap. Actually this gap is small by the standards of Ramsey theory. The
gap has been slightly tightened since Erdös’s work, as we will show later in the course, but remains
exponential, and is a major open problem in combinatorics.
Proof: (of Theorem (5)) This is an opportunity to introduce one of the most-studied random vari-
ables in combinatorics, the random graph G (n, p), in which each edge is present, independently,
with probability p. Among other things, people use this model to study threshold phenomena for
many properties such as connectivity, appearance of a Hamilton cycle, etc.
For the lower bound on R(k, k) we use G (n, 1/2). Any particular subset of k vertices has probability
k
21−(2) of forming either a clique or an independent set. Take a union bound over all subgraphs. 2

16
Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

1.4 Lecture 4 (10/Oct): Upper and lower bounds

1.4.1 Bonferroni inequalities


The union bound is a special case of the Bonferroni inequalities:
Let A1 , . . . , An be events in some probability space, and Aic their complements. For S ⊆ [n] let
T
A S = i ∈S Ai .
For 0 ≤ j ≤ n let ([nj ]) denote the subsets of [n] of cardinality j.

Lemma 6 For j ≥ 1 let (see Fig. 1.1):

mj = ∑ Pr( AS )
S∈([nj ])

k k
Mk = ∑ (−1) j+1 m j = ∑ (−1) j+1 ∑ Pr( A J )
j =1 j =1 J ⊆[n],| J |= j

0 1
1 1
0 1
3 0
1 1
1 1

0 1

Figure 1.1: m2 (left), M2 (right)

Then: [
M2 , M4 , . . . ≤ Pr( Ai ) ≤ M1 , M3 , . . .
S
Moreover, Pr( Ai ) = Mn ; this is known as the inclusion-exclusion principle.

Comment: Often, but not always, larger values of k give improved bounds. See the problem set.
Proof: The sample space is partitioned into 2n measurable sets

Aic )
\ \
BS = ( Ai ) ∩ (
i ∈S i∈
/S

BT , which, since the BT ’s are disjoint, gives Pr( AS ) = ∑S⊆T Pr( BT ).


S
Note that AS = S⊆ T

|T |
 
mj = ∑ Pr( AS ) = ∑ ∑ Pr( BT ) = ∑ Pr( BT ) j
S∈([nj ]) S∈([nj ]) S⊆ T T

17
Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

k
Mk = ∑ (−1) j+1 m j
j =1
k
|T |
 
= ∑ Pr( BT ) ∑ (−1) j+1
T j =1
j
k
j +1 | T |
   
0
= ∑ Pr( BT ) ∑ (−1) because = 0 for j ≥ 1.
T 6=∅ j =1
j j

Ai ) = 1 − Pr( B∅ ) = ∑ T 6=∅ Pr( BT ). So


S
Observe Pr(

k
|T |
 
∑ Pr( BT ) ∑ (−1) j +1
[
Mk − Pr( Ai ) =
T 6=∅ j =0
j

where we have inserted the needed − Pr( BT ) for T 6= ∅ by starting the internal summation from
j = 0.
The inequalities now follow from the claim that for t ≥ 1,

k   0 k≥t
j +1 t
∑ (− 1 )
j
= ≥ 0 k odd (1.13)
j =0

≤ 0 k even

(For the inclusion-exclusion principle, note that once k ≥ n, all t fall into the first category.)
The first line follows by expanding (1 − 1)t (and noting that all terms t < j ≤ k have (tj) = 0).
For the remaining two lines we use the identity

t−1 t−1
      
t t
− = − (1.14)
j j−1 j j−2

(which holds for t, j ≥ 1 with the interpretation (ba) = 0 for a ≥ 0, b < 0).
Therefore when we group adjacent pairs j in the summation on the LHS of (1.13) (that is, {k, k −
1}, {k − 2, k − 3}, etc., with 0 unpaired for k even), we obtain a telescoping sum, and so we have
k
t−1 t−1 t−1
      
t
For k odd: ∑ (−1) j +1
j
=
k

−1
=
k
≥0
j =0
k
t−1 t−1 t−1
        
j +1 t t
For k even: ∑ (− 1 )
j
= −
k
+
0

0
= −
k
≤0
j =0

2
Comment: inclusion-exclusion is a special case of what is known in order theory as Möbius inver-
sion.

1.4.2 Tail events: Borel-Cantelli


Here is a very fundamental application of the union bound.

Definition 7 Let B = { B1 , . . .} be a countable collection of events. lim sup B is the event that infinitely
many of the events Bi occur.

18
Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

Lemma 8 (Borel Cantelli I) Let ∑i≥1 Pr( Bi ) < ∞. Then Pr(lim sup B) = 0.

lim sup B is what is called a tail event: a function of infinitely many other events (in this case the
B1 , . . .) that is unaffected by the outcomes of any finite subset of them.
Proof: It is helpful to write lim sup B as
\[
lim sup B = Bj .
i ≥0 j ≥ i

For every i, lim sup B ⊆ Bj , so Pr(lim sup B) ≤ infi Pr(


S S
j ≥i j ≥i Bj ). By the union bound, the latter
is ≤ infi ∑ j≥i Pr( Bj ) = 0. 2

1.4.3 B-C II: a partial converse to B-C I


Lemma 8 does not have a “full” converse.
To show a counterexample, we need to come up with events Bi for which ∑i≥1 Pr( Bi ) = ∞ but
Pr(lim sup B) = 0. Here is an example. Pick a point x uniformly from the unit interval. Let Bi be
the event x < 1/i.
You will notice that in this example the events are not independent. That is crucial, for B-C I does
have the partial converse:

Lemma 9 (Borel Cantelli II) Suppose that B1 , . . . are independent events and that ∑i≥1 Pr( Bi ) = ∞.
Then Pr(lim sup B) = 1.

Proof: We’ll show that (lim sup c , the event that only finitely many B occur, occurs with proba-
B)T i
c
bility 0. Write (lim sup B) = i≥0 j≥i Bcj .
S

Bcj ) = 0 for all i. Of course, for any


T
By the union bound (Cor. 3), it is enough to show that Pr( j ≥i
I ≥ i, Pr( j≥i Bcj ) ≤ Pr( I ≥ j≥i Bcj ).
T T

Bcj ) = ∏iI Pr( Bcj ), so what remains to show is that


T
By independence, Pr( I ≥ j ≥i

I
For any i, lim
I →∞
∏ Pr( Bcj ) = 0. (1.15)
i

(Note the LHS is decreasing in I.)


There’s a classic inequality we often use:

1 + x ≤ ex (1.16)

which follows because the RHS is concave and the two sides agree in value and first derivative at a
point (namely at x = 0).
Consequently if a finite sequence xi satisfies ∑ xi ≥ 1 then ∏(1 − xi ) ≤ 1/e.
Supposing (1.15) is false, fix i for which it fails, let qi be the limit of the LHS, and let I be suf-
0 0
ficient that ∏iI Pr( Bcj ) ≤ 2qi . Let I 0 be sufficient that ∑ II +1 Pr( Bj ) ≥ 1. Then ∏iI Pr( Bcj ) ≤ 2qi /e.
Contradiction. 2

19
Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

1.5 Lecture 5 (12/Oct): More on tail events: Kolmogorov 0-1, ran-


dom walk
A beautiful fact about tail events is Kolmogorov’s famous 0-1 law.

Theorem 10 (Kolmogorov) If Bi is a sequence of independent events and C is a tail event of the sequence,
then Pr(C ) ∈ {0, 1}.

We won’t be using this theorem, and its usual proof requires some measure theory, so I’ll merely
offer a few examples of its application.

Bond percolation

Fix a parameter 0 ≤ p ≤ 1. Start with a fixed infinite, connected, locally finite graph H, for instance
the grid graph Z2 (nodes (i, j) and (i0 , j0 ) are connected if |i − i0 | + | j − j0 | = 1) and form the graph
G by including each edge of the grid in G independently with probability p. “Locally finite” means
the degree of every vertex is finite. The graph is said to “percolate” if there is an infinite connected
component.
Percolation is a tail event (with respect to the events indicating whether each edge is present):
consider the effect of adding or removing just one edge. Now induct on the number of edges added
or removed.
It is easy to see by a coupling argument that Pr(percolation) is monotone nondecreasing in p, as
follows: Instead of choosing just a single bit at each edge e, choose a real number Xe ∈ [0, 1]
uniformly. Include the edge if Xe < p. Now, if p < p0 , we can define two random graphs G p , G p0 ,
each is a percolation process from the respective parameter value, and G p ⊆ G p0 .
Due to the 0-1 law, there exists a “critical” p H such that Pr(percolation) = 0 for p < p H and
Pr(percolation) = 1 for p > p H . (See Fig. 1.5.) A lot of work in probability theory has gone into
determining values of p H for various graphs, and also into figuring out whether Pr(percolation) is
0 or 1 at p H .

Pr(percolate)

1.0

0.8

0.6

0.4

0.2

p
0.2 0.4 0.6 0.8 1.0

Figure 1.2: Bond percolation in the 2D square grid

20
Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

Another example of a tail event for bond percolation, this one not monotone, is the event that there
are infinitely many infinite components.
No matter what the underlying graph is, the probability of this event is 0 at p ∈ {0, 1}.

Site percolation

A closely related process is that starting from a fixed infinite, connected, locally finite graph H, we
retain vertices independently with probability p. (And of course we retain an edge if both its vertices
are retained.)
Let N0 be the random variable representing the number of infinite components in the graph that is
our random variable. Here “number” can be any nonnegative integer or ∞.
It is known under fairly general conditions (particularly, H should be vertex-transitive), that for any
p, exactly one of the following three events has probability 1: N0 = 0; N1 = 1; N∞ = 1. See [71]
for the beginning of this story, and [10] for a survey. (It happens that in the square grid, in any
dimension and for any p, Pr( N∞ ) = 0; as p increases, we go from Pr( N0 ) = 1 to Pr( N1 = 1) [6],
and stay there. However, in more “expanding” graphs such as d-regular trees, d > 1, and also other
non-amenable graphs, there can be a phase in the middle with Pr( N∞ ) = 1. See [65, 47].)

Random Walk on Z

Here is another example of a tail event, but this one we can work out without relying on the 0-1
law, and also see which of 0, 1 is the value:
Consider rw on Z that starts at 0 and in every step with probability p goes left, and with probability
1 − p goes right. Let L = the event that the walk visits every x ≤ 0. Let R = the event that the
walk visits every x ≥ 0. Each of L and R is a tail event. So by Theorem 10, for any p, Pr( L) and
Pr( R) lie in {0, 1}. In fact, we will show—without relying on Theorem 10, but relying on Lemma 8
(Borel-Cantelli I)—that:

Theorem 11

• For p < 1/2, Pr( L) = 0 and Pr( R) = 1.

• For p > 1/2, Pr( L) = 1 and Pr( R) = 0. (Obviously this is symmetric to the preceding.)
• For p = 1/2, Pr( L) = Pr( R) = 1.
(Note that if L ∩ R occurs, then the walk must actually visit every point infinitely often. (Suppose not,
and let t be the last time that some site y was visited. Then on one side of y, the point t + 1 steps
away cannot have been visited yet, and will never be visited.) Thus in this case of the theorem, since
Pr( L ∩ R) = 1 by union bound, Pr(every point in Z is visited infinitely often) = 1. The term for this
is that unbiased rw on the line is recurrent.)

Proof: First, no matter what p is, let qy be the probability that the walk ever visits the point y.
Let’s start with the cases p 6= 1/2. The first step of the argument doesn’t depend on the sign of
p − 1/2: Consider any y and let By,t = the event that the walk is at y at time t. The following

21
Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

calculation shows that for any y, ∑t Pr( By,t ) < ∞: For t s.t. t = y mod 2, we have
 
t t−y t+y
Pr( By,t ) = t−|y| p 2 (1 − p ) 2
2
1 − p y/2
  
t
= t−|y| ( p(1 − p))t/2
2 p
y/2
t 1− p

≤ 2 ( p(1 − p))t/2
p
1 − p y/2
 
= (4p(1 − p))t/2
p

Therefore y/2
1− p

1
∑ Pr( By,t ) ≤ p 1−
p
4p(1 − p)
t

which is < ∞ for p 6= 1/2. So by Borel-Cantelli-I (Lemma 8), with probability 1, y is visited only
finitely many times. Then by the union bound, with probability 1 every y is visited only finitely
many times.
Now let’s suppose further that p > 1/2 (i.e., the walk drifts left). Then for any x ∈ Z,
 x/2
1− p

1 1
∑ ∑ Pr( By,t ) ≤ p
·
1−
p
(1 − p)/p 1 −
· p
4p(1 − p)
<∞
y≥ x t

So we get the even stronger conclusion, again by BC-I, that with probability 1 the walk spends only
finite time in the interval [ x, ∞]. Since this holds for all x, we get Pr( L) = 1. Plugging in x = 0 gives
Pr( R) = 0.
Applying symmetry, we’ve covered the first two cases of the theorem.
For p = 1/2, the claims Pr( L) = 1 and Pr( R) = 1 are equivalent so let’s focus on the first.
The claim Pr( L) = 1 is equivalent to saying that for any x ≥ 0, with probability 1 the walk reaches
the point − x. This is the same as saying that in the gambler’s ruin problem, no matter what the
initial stake x of the gambler, he will with probability 1 go broke.
For x ≥ 0 let’s write q x = the probability the gambler goes broke from initial stake x. We claim that
q x is harmonic on the nonnegative axis with boundary condition q0 = 1. The harmonic condition
means that on all interior points of the nonnegative axis, which means all x > 0, the function value
is the average of its neighbors:
q x = (q x−1 + q x+1 )/2
That this is so is obvious from the description of the gambler’s ruin process. But this equation
indicates that q x is affine linear on x ≥ 0, because for x ≥ 1 the “discrete second derivative” is 0:

(q x+1 − q x ) − (q x − q x−1 ) = q x+1 − 2q x + q x−1 = 0

However, the function q x is also bounded in [0, 1]. So it can only be a constant function, agreeing
with its boundary value q0 = 1. 2

22
Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

1.6 Lecture 6 (15/Oct): More probabilistic method

1.6.1 Markov inequality (the simplest tail bound)

Lemma 12 Let A be a non-negative random variable with finite4 expectation µ1 . Then for any λ ≥ 1,
Pr( A > λµ1 ) < 1/λ. In particular, for µ1 = 0, Pr( A > µ1 ) = 0.

(Of course the lemma holds trivially also for 0 < λ < 1.)
Proof:
If Pr( A ≥ λµ1 ) > 1/λ then E( A) > µ1 , a contradiction. So Pr( A ≥ λµ1 ) ≤ 1/λ and therefore,
if the lemma fails, it must be that Pr( A > λµ1 ) = 1/λ. In particular for some ε > 0 there is a
δ > 0 s.t. Pr( A ≥ λµ1 + ε) ≥ δ. Then E( A) ≥ δ · (λµ1 + ε) + (1/λ − δ) · λµ1 = µ1 + δε > µ1 , a
contradiction. 2
For a more visual argument (but proving the slightly weaker Pr( A ≥ λµ1 ) ≤ 1/λ), note that the
step function Jx ≥ λµ1 K satisfies the inequality Jx ≥ λµ1 K ≤ λµx for all nonnegative x. If µ is the
1
probability distribution of the rv A, then Pr( A ≥ λµ1 ) = Jx ≥ λµ1 K dµ ≤ λµx = λµ1 = λ1 .
R R µ
1 1

1.6.2 Variance and the Chebyshev inequality: a second tail bound

Let X be a real-valued rv. If E( X ) and E( X 2 ) are both well-defined and finite, let Var( X ) = E( X 2 ) −
E( X )2 . We can also see that E(( X − E( X ))2 ) = Var( X ) by expanding the LHS and applying linearity
of expectation. In particular, the variance is nonnegative.
If c ∈ R then since the variance is homogenous and quadratic, Var(cX ) = c2 Var( X ).
p
Lemma 13 (Chebyshev) If E( X ) = θ, then Pr(| X − θ | > λ Var( X )) < 1/λ2 .

Proof:
 q   
Pr | X − θ | > λ Var( X ) = Pr ( X − θ )2 > λ2 Var( X )

< 1/λ2

by the Markov inequality (Lemma 12). 2


The Chebyshev inequality is the most elementary and weakest kind of concentration bound. We will
talk about this more in the context of sums of random variables.
A frequently useful corollary of the Chebyshev inequality (Lemma 13) is:

Var( X )
Corollary 14 Suppose X is a nonnegative rv. Then Pr( X = 0) ≤ ( E( X ))2
.

1.6.3 Power mean inequality


Nonnegativity of the variance is merely a special case of monotonicity of the power means. (In
this context, though, we will assume the random variable X is positive-valued. For the variance we
don’t need this constraint.)
4 For a nonnegative rv there can be no problems with absolute convergence of the expectation; however, it may be infinite.

23
Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

Lemma 15 (Power means inequality) For a positive-real-valued rv X, and for reals s < t,
1/t
( E( X s ))1/s ≤ E( X t ) .

Proof: Let µ be the probability


R measure. Recall
R that for r ≥ 1, the function xr is convex (“cup”) in x.
For a convex function, f ( x ) dµ( x ) ≤ f ( x dµ( x )). (This is sometimes called Jensen’s inequality.)
Applying this with r = t/s, we have
Z Z s/t
s t
X dµ ≤ X dµ .

2
Using the concave function f ( x ) = log( x ) gives us
Z Z
exp log x dµ ≤ xdµ (1.17)

which is the arithmetic-geometric mean inequality: in the case of a uniform distribution on n


positive values of X, it reads (∏ Xi )1/n ≤ n1 ∑ Xi . That (1.17) is a special case of the power means
inequality can be seen by fixing t = 1 and taking the limit s → 0 (approximating x s by 1 + s log x).

1.6.4 Large girth and large chromatic number; the deletion method
Earlier we saw our first example of the probabilistic method, the proof of the existence of graphs with
no small clique or independent set. In that case, just picking an element of a set at random was
already enough in order to produce an object that is hard to construct “explicitly”.
However, the probabilistic method in that form can construct only an object with properties that
are shared by a large fraction of objects. Now we will see an example that enables the probabilistic
method to construct something that is quite rare—indeed, it is maybe a bit surprising that this kind
of object even exists.
We consider graphs here to be undirected and without loops or multiple edges.
The chromatic number χ of a graph is the least number of colors with which the vertices can be
colored, so that no two neighbors share a color. Clearly, as you add edges to a graph, its chromatic
number goes up.
The girth γ of a graph is the length of a shortest simple cycle. (“Simple” = no edges repeat.) Clearly,
as you add edges to a graph, its girth goes down.
These numbers are both monotone in the inclusion partial order on graphs. Chromatic number
is monotone increasing, while girth is monotone decreasing. An important theorem we’ll cover
shortly is the FKG Inequality, which implies in this setting that for any k, g > 0, if you pick a graph
u.a.r., and condition on the event that its chromatic number is above k, that reduces the probability
that its girth will be above g. In symbols, for the G (n, p) ensemble,

Pr((χ( G ) > k) ∩ (γ( G ) > g)) < Pr(χ( G ) > k) Pr(γ( G ) > g).

So in this precise sense, chromatic number and girth are anticorrelated. Indeed, having large girth
means that the graph is a tree in large neighborhoods around each vertex. A tree has chromatic
number 2. If you just allow yourself 3 colors, you gain huge flexibility in how to color a tree. Surely,

24
Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

with large girth, you might be able to color the local trees so that when they finally meet up in
cycles, you can meet the coloring requirement?
No!
Here is a remarkable theorem.

Theorem 16 (Erdös [29]) For any k, g there is a graph with chromatic number χ ≥ k and girth γ ≥ g.

Proof: Pick a graph G from G (n, p), where p = n−1+1/g . This is likely to be a fairly sparse graph,
the expected degree is n1/g (minus 1).
g −1
Let the rv X be the number of cycles in G of length < g. E( X ) = ∑m=3 pm n · (n − 1) · · · (n − m +
1)/(2m). (Pick the cycle sequentially and forget the starting point and orientation.) Then

g −1 g −1 g −1
E( X ) < ∑ pm nm /(2m) = ∑ nm/g /(2m) ≤ ∑ nm/g /6.
m =3 m =3 m =3

For sufficiently large n, specifically n > 2g , the successive terms in this sum at least double, so
E( X ) ≤ n1−1/g /3. By Markov’s inequality, Pr( X > n1−1/g ) < 1/3.
For the chromatic number we use a simple lower bound. Let I be the size of a largest independent
set in G. Since every color class of a coloring must be an independent set,

I · χ ≥ n. (1.18)

i
Now Pr( I ≥ i ) ≤ (ni)(1 − p)(2) , and recalling (1.16), the simple inequality for the exponential
i i −1+1/g
function, we have Pr( I ≥ i ) ≤ (ni)e− p(2) = (ni)e−(2)n . Using the wasteful bound (ni) ≤ ni we
have Pr( I ≥ i ) ≤ e i log n−(2i )n−1+1/g =e i log n+(i/2−i2 /2)n−1+1/g .
Finally we apply this at i = 3n1−1/g log n.
1−1/g log2 n+ 12 (3n1−1/g log n)n−1+1/g − 12 (3n1−1/g log n)2 n−1+1/g
Pr( I ≥ i ) ≤ e3n
3 1−1/g log2 n)
= e 2 (log n−n

which for sufficiently large n is < 1/3.


Thus, for sufficiently large n, there is probability at least 1/3 that G has both I < 3n1−1/g log n and
at most n1−1/g ≤ n/2 cycles of length strictly less than g.
Removing vertices from G can only reduce I, because any set that is independent after the removal,
was also independent before. (By contrast, removing edges can only increase I.) So, by removing
one vertex from each cycle, we obtain a graph with ≥ n/2 vertices, girth ≥ g, and I ≤ 3n1−1/g log n.
Applying (1.18) (to the graph now of size n/2), we have χ ≥ n1/g /(6 log n) which for sufficiently
large n is ≥ k. 2

25
Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

1.7 Lecture 7 (17/Oct): FKG inequality


Consider again the random graph model G (n, p). Suppose someone peeks at the graph and tells
you that it has a Hamilton cycle. How does that affect the probability that the graph is planar? Or
that its girth is less than 10?
Or consider the percolation process on the n × n square grid. Suppose you check and find that
there is a path from (0, 0) to (1, 0). What does that tell you about the chance that the graph has an
isolated vertex?
These questions fall into the general framework of correlation inequalities. History: Harris (1960) [45],
Kleitman (1966) [57], Fortuin, Kasteleyn and Ginibre (1971) [34], Holley (1974) [49], Ahlswede and
Daykin “Four Functions Theorem” (1978) [5].
We are concerned here with the probability space Ω of n independent random bits b1 , . . . , bn . It
doesn’t matter whether they are identically distributed. Let pi = Pr(bi = 1).
We consider the boolean lattice B onthese
 bits: b ≥ b0 if for all i, bi ≥ bi0 . So, Ω is the distribution
on B for which Pr(b) = ∏{i:bi =1} pi ∏{i:bi =0} (1 − pi ) .

Definition 17 A real-valued function f on Ω is increasing if b ≥ b0 ⇒ f (b) ≥ f (b0 ). It is decreasing if − f


is increasing. Likewise, an event on Ω (or in other words a subset of B) is increasing if its indicator function
is increasing, and decreasing if its indicator function is decreasing.

Theorem 18 (FKG [34]) If f and g are increasing functions on Ω then

E( f g) ≥ E( f ) E( g)

Corollary 19 1. If A and B are increasing events on Ω then Pr( A ∩ B) ≥ Pr( A) Pr( B).
2. If f is an increasing function and g is a decreasing function, then E( f g) ≤ E( f ) E( g).

3. If A is an increasing event and B is a decreasing event, then Pr( A ∩ B) ≤ Pr( A) Pr( B).

Before we begin the proof we should introduce an important concept:


Conditional expectation
Suppose X and Y are random variables. Let T be some subset T of the range of Y with Pr(Y ∈ T ) > 0
(actually this restrictive assumption is not necessarily but for most purposes in this course we can
settle for finite sample spaces, so we won’t worry about doing the measure theory. It all works out
as you’d expect.) Then
1
Z
E ( X |Y ∈ T ) = X dµ
Pr(Y ∈ T ) Y −1 (T )

The conditional expectation E( X |Y ) is a random variable, and specifically, it is a function of the


random variable Y.
This has the following natural consequence, which is called the tower property of conditional ex-
pectations:
E( X ) = E( E( X |Y )) (1.19)
Notice that on the RHS, the outer expectation is over the distribution of Y; on the inside we have
the rv which is a real number that is, as we have said, a function of Y. To see (1.19) in the case of
discrete rvs, one need only note that both sides equal E( X ) = ∑y Pr(Y = y) E( X |Y = y).

26
Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

Of course, (1.19) still makes sense under any conditioning on a third rv Z:

E( X | Z ) = E( E( X |Y )| Z ) (1.20)

Now let us reinterpret the theorem. Suppose g is the indicator function of some increasing event.
Then

E( f g) = Pr( g = 1) E( f | g = 1) + Pr( g = 0) E(0| g = 0) = Pr( g = 1) E( f | g = 1) = E( g) E( f | g = 1)

so
E( f g) E( f ) E( g)
E ( f | g = 1) = ≥ = E ( f ).
E( g) E( g)
The interpretation is that conditioning on an increasing event, only increases the expectation of any
increasing function.
Proof: By induction on n.
Case n = 1 (with p = Pr(b = 1)):

E( f g) − E( f ) E( g) = p f (1) g(1) + (1 − p) f (0) g(0) − ( p f (1) + (1 − p) f (0))( pg(1) + (1 − p) g(0))


= p(1 − p)( f (1) g(1) + f (0) g(0) − f (1) g(0) − f (0) g(1)
= p(1 − p)( f (1) − f (0))( g(1) − g(0))
≥ 0 by the monotonicity of both functions

Now for the induction. Observe that for any assignment (b2 . . . bn ) ∈ {0, 1}n−1 , f becomes a mono-
tone function of the single bit b1 .
For convenience, in the expectations to follow the subscript indicates explicitly which subset of bits
the expectation is taken with respect to. So for instance in the second line, f g has the role of X,
above, and (b2 . . . bn ) has the role of Y. These subscripts are extraneous and I’m just including them
for clarity.

E( f g) = E1...n ( f g)
= E2...n ( E1 ( f g|b2 . . . bn ))
≥ E2...n ( E1 ( f |b2 . . . bn ) · E1 ( g|b2 . . . bn )) applying the base-case

Observe again that E1 ( f |b2 . . . bn ) is a function of b2 . . . bn . By monotonicity of f , it is an increasing


function. Likewise for E1 ( g|b2 . . . bn ). Since by induction we may assume the theorem for the case
n − 1, we have

... ≥ E2...n ( E1 ( f |b2 . . . bn )) · E2...n ( E1 ( g|b2 . . . bn ))


= E1...n ( f ) · E1...n ( g) by (1.19)
= E( f ) E( g)

2
Easy Application:
In the random graph G (2k, 1/2), there is probability at least 2−2k that all degrees are ≤ k − 1.
(Call this event A. One can also ask for an upper bound on Pr( A). As was noted in class, A is
disjoint from the event that all degrees are at least k, which has by symmetry the same probability,
so we can conclude that Pr( A) ≤ 1/2. Here is a simple improvement, showing Pr( A) tends toward
0. Fix a set L of the vertices, of size `. For v ∈ L, if it has at most k − 1 neighbors, then it has at most
k − 1 neighbors in Lc . So we’ll just upper bound the probability that every vertex in L has at most

27
Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

k − 1 neighbors in Lc . These events (ranging over v ∈ L) are independent. So we can use the upper
 ` √
bound 2−2k+` (≤2kk−`
−1 ) . Fixing ` proportional to k sets the base of this exponential to a constant
< 1 (a deviation bound at a constant number of √ standard deviations from the mean), and therefore
yields a bound of the form Pr( A) ≤ exp(−Ω( k)).)
Application:
The FKG inequality provides a very efficient proof of an inequality of Daykin and Lovász [25]:

Theorem 20 Let H be a family of subsets of [n] such that for all A, B ∈ H, ∅ ⊂ A ∩ B and A ∪ B ⊂ [n]
(strict containments). Then | H | ≤ 2n−2 .

Proof: Let F be the “upward order ideal” generated by H: F = {S : ∃ T ∈ H, T ⊆ S}. Let G be the
“downward order ideal” generated by H: G = {S : ∃ T ∈ H, S ⊆ T }. Then H ⊆ F ∩ G.
| F | ≤ 2n−1 because F satisfies the property that ∅ ⊂ A ∩ B for all A, B ∈ F, and therefore F cannot
contain any set and its complement.
Likewise, | G | ≤ 2n−1 because G satisfies the property that A ∪ B ⊂ [n] for all A, B ∈ G, and therefore
G cannot contain any set and its complement.
Interpreting this in terms of the bits being distributed uniformly iid, we have that Pr( F ) ≤ 1/2 and
Pr( G ) ≤ 1/2. Since F is an increasing event and G a decreasing event, Pr( F ∩ G ) ≤ 1/4. 2
Application:
We won’t show the argument here, but the FKG inequality was used in a very clever way by Shepp
to prove the “XYZ inequality” conjectured by Rival and Sands. Let Γ be a finite poset. A linear
extension of Γ is any total order on its elements that is consistent with Γ. Consider the uniform
distribution on linear extensions of Γ. The XYZ inequality says:

Theorem 21 (Shepp [81]) For any three elements x, y, z of Γ,

Pr(( x ≤ y) ∧ ( x ≤ z)) ≥ Pr( x ≤ y) · Pr( x ≤ z).

28
Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

1.8 Lecture 8 (19/Oct) Part I: Achieving expectation in MAX-3SAT.


Logistics: Jenish’s OH will be in Annenberg 107 or 121 according to the day; see class webpage and
Google calendar.

1.8.1 Another appetizer


Consider unbiased random walk on the n-cycle. Index the vertices clockwise 0, . . . , n − 1, and start
the walk at 0. What is the probability distribution on the last vertex to be reached?

1.8.2 MAX-3SAT
Let’s start looking at some computational problems. A 3CNF formula on variables x1 , . . . , xn is the
conjunction of clauses, each of which is a disjunction of at most three literals. (A literal is an xi or
xic , where xic is the negation of xi .)
You will recall that it is NP-complete to decide whether a 3CNF formula is satisfiable, that is,
whether there is an assignment to the xi ’s s.t. all clauses are satisfied. Let’s take a little different
focus: think about the maximization problem of satisfying as many clauses as possible. Of course
this is NP-hard, since it includes satisfiability as a special case. But, being an optimization problem,
we can still ask how well we can do.

Theorem 22 For any 3CNF formula there is an assignment satisfying ≥ 7/8 of the clauses. Moreover such
an assignment can be found in randomized time O(m2 ), where m is the number of clauses (and we suppose
that every variable occurs in some clause).

Proof: The existence assertion is due to linearity of expectation, while the algorithm might be
attributed to the English educator Hickson [48]: ’Tis a lesson you should heed: / Try, try, try again.
/ If at first you don’t succeed, / Try, try, try again. Now that we’ve been suitably educated, let’s ask,
how long does this process take? In a single trial we check one assignment, which takes time O(m).
How many trials do we need to succeed?
Let the rv M be the number of satisfied clauses of a random assignment. m − M is a nonnegative
rv with expectation m/8, and Markov’s inequality tells us that Pr( M ≤ (7/8 − ε)m) = Pr(m − M ≥
(1 + 8ε)m/8) ≤ 1/(1 + 8ε).
This says we have a good chance of getting close to the desired number of satisfied clauses; however,
we asked to achieve 7/8, not 7/8 − ε. We can get this by noting that M is integer-valued, so for
ε < 1/m, an assignment satisfying 7/8 − ε of the clauses, satisfies 7/8 of them.
1
With the choice ε = 2m , then, the probability that a trial succeeds is at least

1 8ε 4
1− = = ∈ Ω(1/m)
1 + 8ε 1 + 8ε m+4
Trials succeed or fail independently so the expected number of trials to success is the expectation
of a geometric random variable with parameter Ω(1/m), which is O(m). 2

1.8.3 Derandomization by the method of conditional expectations


How can we improve on this simple-minded method? We do not have a way forward on increasing
the fraction of satisfied clauses, because of:

29
Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

Theorem 23 (Håstad [46]) For all ε > 0 it is NP-hard to approximate Max-3SAT within factor 7/8 + ε.

But we might hope to reduce the runtime, and also perhaps the dependence on random bits. As it
turns out we can accomplish both of these objectives.

Theorem 24 There is an O(m)-time deterministic algorithm to find an assignment satisfying 7/8 of the
clauses of any 3CNF formula on m clauses.

Proof: This algorithm illustrates the method of conditional expectations. The point is that we can de-
randomize the randomized algorithm by not picking all the variables at once. Instead, we consider
the alternative choices to just one of the variables, and choose the branch on which the conditional
expected number of satisfying clauses is greater.
This very general method works in situations in which one can actually quickly calculate (or at least
approximate) said conditional expectations.
We use the tower property of conditional expectations, (1.19): letting Y = the number of satisfied
clauses for a uniformly random setting of the rvs,

E(Y ) = E( E(Y | x1 ))

or explicitly
1 1
E (Y ) =E (Y | x 1 = 0 ) + E (Y | x 1 = 1 )
2 2
and the strategy is to pursue the value of x1 which does better.
In the present example computing the conditional expectations is easy. The probability that a clause
of size i is satisfied is 1 − 2−i . If a formula has mi clauses of size i, the expected number of satisfied
clauses is ∑ mi (1 − 2−i ). Now, partition the clauses of size i into m1i that contain the literal x1 , m0i
that contain the literal x1c , and those that contain neither.
The expected number of satisfied clauses conditional on setting x1 = 1 is

∑ m1i + ∑ m0i (1 − 2−i+1 ) + ∑(mi − m1i − m0i )(1 − 2−i ). (1.21)

Similarly the expected number of satisfied clauses conditional on setting x1 = 0 is

∑ m1i (1 − 2−i+1 ) + ∑ m0i + ∑(mi − m1i − m0i )(1 − 2−i ). (1.22)

A simple way to do this: we can compute each of these quantities in time O(m). (Actually, since
these quantities average to the current expectation, which we already know, we only have to cal-
culate one of them.) This simple process runs in time O(m2 ). However, we can actually do the

Clauses C1 C2 C3 C4 ... Cm

Variables x1 x2 x3 ... xn

Figure 1.3: m clauses of size ≤ 3, n variables

process in time O(m). We don’t even really need to calculate the quantities (1.21),(1.22). We start
with variable x1 and scan all the clauses it participates in (see Fig. 1.3). For each clause Ci (which
say has currently |Ci | literals), the effect of setting x1 = 1 change the contribution of the clause

30
Schulman: CS150 2018 CHAPTER 1. SOME BASIC PROBABILITY THEORY

to the expectation from 1 − 2−|Ci | to either 1 (if the variable satisfies the clause) or to 1 − 2−|Ci −1|
(otherwise); i.e., the expectation either increases or decreases by 2−|Ci | , while the effect of setting
x1 = 0 is exactly the negative of this. We add these contributions of ±2−|Ci | , conditional on x1 = 1,
as we scan the clauses containing x1 ; if it is nonnegative we fix x1 = 1, otherwise we fix x1 = 0.
Having done that, we edit the relevant clauses to eliminate x1 from them. Then we continue with
x2 , etc. The work spent per variable is proportional to its degree in this bipartite graph (the number
of clauses containing it), and the sum of these degrees is ≤ 3m. So the total time spent is O(m). 2

31
Chapter 2

Algebraic Fingerprinting

There are several key ways in which randomness is used in algorithms. One is to “push apart”
things that are different even if they are similar. We’ll study a few examples of this phenomenon.

2.1 Lecture 8 (19/Oct) Part II: Fingerprinting with Linear Algebra

2.1.1 Polytime Complexity Classes Allowing Randomization


Some definitions of one-sided and two-sided error in randomized computation are useful.

Definition 25 BPP, RP, co-RP, ZPP: These are the four main classes of randomized polynomial-time compu-
tation. All are decision classes. A language L is:

• In BPP if the algorithm errs with probability ≤ 1/3.


• In RP if for x ∈ L the algorithm errs with probability ≤ 1/3, and for x ∈
/ L the algorithm errs with
probability 0.

(note, RP is like NP in that it provides short proofs of membership), while the subsidiary definitions are:

• L ∈ co-RP if Lc ∈ RP, that is to say, if for x ∈ L the algorithm errs with probability 0, and if for x ∈
/L
the algorithm errs with probability ≤ 1/3.
• ZPP = RP ∩ co-RP.

It is a routine exercise that none of these constants matter and can be replaced by any 1/poly, although
completing that exercise relies on the Chernoff bound which we’ll see in a later lecture.

Exercise: Show that the following are two equivalent characterizations of ZPP: (a) there is a poly-
time randomized algorithm that with probability ≥ 1/3 outputs the correct answer, and with the
remaining probability halts and outputs “don’t know”; (b) there is an expected-poly-time algorithm
that always outputs the correct answer.
We have the following obvious inclusions:

P ⊆ ZPP ⊆ RP, co-RP ⊆ BPP

32
Schulman: CS150 2018 CHAPTER 2. ALGEBRAIC FINGERPRINTING

What is the difference between ZPP and BPP? In BPP, we can never get a definitive answer, no
matter how many independent runs of the algorithm execute. In ZPP, we can, and the expected
time until we get a definitive answer is polynomial; but we cannot be sure of getting the definitive
answer in any fixed time bound.
Here are the possible outcomes for any single run of each of the basic types of algorithm:

x∈L x∈
/L
RP ∈, ∈
/ ∈
/
co-RP ∈ ∈, ∈
/
BPP ∈, ∈
/ ∈, ∈
/

If L ∈ ZPP, then we can be running simultaneously an RP algorithm A and a co-RP algorithm B


for L. Ideally, this will soon give us a definitive answer: if both algorithms say “x ∈ L”, then A
cannot have been wrong, and we are sure that x ∈ L; if both algorithms say “x ∈ / L”, then B cannot
have been wrong, and we are sure that x ∈ / L. The expected number of iterations until one of these
events happens (whichever is viable) is constant. But, in any particular iteration, we can (whether
x ∈ L or x ∈/ L) get the indefinite outcome that A says “x ∈ / L” and B says “x ∈ L”. This might
continue for arbitrarily many rounds, which is why we can’t make any guarantee about what we’ll
be able to prove in bounded time.
An algorithm with “BPP”-style two-sided error is often referred to as “Monte Carlo”, while a “ZPP”-
style error is often referred to as “Las Vegas”.

2.1.2 Verifying Matrix Multiplication


It is a familiar theme that verifying a fact may be easier than computing it. Most famously, it is widely
conjectured that P6=NP. Now we shall see a more down-to-earth example of this phenomenon.
In what follows, all matrices are n × n. In order to eliminate some technical issues (mainly numerical
precision, also the design of a substitute for uniform sampling), we suppose that the entries of the
matrices lie in Z/p, p prime; and that scalar arithmetic can be performed in unit time.
(The same method will work for any finite field and a similar method will work if the entries are
integers less than poly(n) in absolute value, so that we can again reasonably sweep the runtime for
scalar arithmetic under the rug.)
Here are two closely related questions:

1. Given matrices A, B, compute A · B.

2. Given matrices A, B, C, verify whether C = A · B.

The best known algorithm, as of 2014, for the first of these problems runs in time O(n2.3728639 ) [39].
Resolving the correct lim inf exponent (usually called ω) is a major question in computer science.
Clearly the second problem is no harder, and a lower bound of Ω(n2 ) even for that is obvious since
one must read the whole input.
Randomness is not known to help with problem (1), but the situation for problem (2) is quite
different.

Theorem 26 (Freivalds [36]) There is a co-RP-type algorithm for the language “C = A · B”, running in
time O(n2 ).

33
Schulman: CS150 2018 CHAPTER 2. ALGEBRAIC FINGERPRINTING

I wrote “co-RP-type” rather than co-RP because the issue at stake is not the polynomiality of the
runtime (since nω +o(1) is an upper bound and the gain from randomization is that we’re achieving
n2 ), but only the error model.
Proof: Note that the obvious procedure for matrix-vector multiplication runs in time O(n2 ).
The verification algorithm is simple. Select uniformly a vector x ∈ (Z/p)n . Check whether ABx =
Cx without ever multiplying AB: applying associativity, ( AB) x = A( Bx ), this can be done in just
three matrix-vector multiplications. Output “Yes” if the the equality holds; output “No” if it fails.
Clearly if AB = C, the output will be correct. In order to get a co-RP-type result, it remains to show
that
Pr( ABx = Cx | AB 6= C ) ≤ 1/2.
The event ABx = Cx is equivalently stated as the event that x lies in the right kernel of AB − C.
Given that AB 6= C, that kernel is a strict subspace of (Z/p)n and therefore of at most half the
cardinality of the larger space. Since we select x uniformly, the probability that it is in the kernel is
at most 1/2. 2

34
Schulman: CS150 2018 CHAPTER 2. ALGEBRAIC FINGERPRINTING

2.2 Lecture 9 (22/Oct): Fingerprinting with Linear Algebra

2.2.1 Verifying Associativity


Let a set S of size n be given, along with a binary operation ◦ : S × S → S. Thus the input is a
table of size n2 ; we call the input (S, ◦). The problem we consider is testing whether the operation
is associative, that is, whether for all a, b, c ∈ S,

( a ◦ b) ◦ c = a ◦ (b ◦ c) (2.1)
A triple for which (2.1) fails is said to be a nonassociative triple.
No sub-cubic-time deterministic algorithm is known for this problem. However,

Theorem 27 (Rajagopalan & Schulman [73]) There is an O(n2 )-time co-RP type algorithm for associa-
tivity.

Proof: An obvious idea is to replace the O(n3 )-time exhaustive search for a nonassociative triple, by
randomly sampling triples and and checking them. The runtime required is inverse to the fraction
of nonassociative triples, so this method would improve on exhaustive search if we were guaranteed
that a nonassociative operation had a super-constant number of nonassociative triples. However,
for every n ≥ 3 there exist nonassociative operations with only a single nonassociative triple.
So we’ll have to do something more interesting.
Let’s define a binary operation (S, ◦) on a much bigger set S. Define S to be the vector space with
basis S over the field Z/2, that is to say, an element x ∈ S is a formal sum

x= ∑ axa for x a ∈ Z/2


a∈S

The product of two such elements x, y is

x◦y = ∑ ∑ ( a ◦ b) x a yb
a∈S b∈S

∑c
M
= x a yb
c∈S a,b:a◦b=c
L
where of course denotes sum mod 2.
On (S, ◦) we have an operation that we do not have on (S, ◦), namely, addition:

x+y = ∑ a( x a + yb )
a∈S

(Those who have seen such constructions before will recognize (S, ◦) as an “algebra” of (S, ◦) over
Z/2.)
The algorithm is now simple: check the associative identity for three random elements of S. That is, select
x, y, z u.a.r. in S. If ( x ◦ y) ◦ z) = x ◦ (y ◦ z), report that (S, ◦) is associative, otherwise report that it
is not associative. The runtime for this process is clearly O(n2 ).
If (S, ◦) is associative then so is (S, ◦), because then ( x ◦ y) ◦ z and x ◦ (y ◦ z) have identical ex-
pansions as sums. Also, nonnassociativity of (S, ◦) implies nonnassociativity of (S, ◦) by simply
considering “singleton” vectors within the latter.
But this equivalence is not enough. The crux of the argument is the following:

35
Schulman: CS150 2018 CHAPTER 2. ALGEBRAIC FINGERPRINTING

Lemma 28 If (S, ◦) is nonnassociative then at least one eighth of the triples ( x, y, z) in S are nonassociative
triples.

Proof: The proof relies on a variation on the inclusion-exclusion principle.


For any triple a, b, c ∈ S, let
g( a, b, c) = ( a ◦ b) ◦ c − a ◦ (b ◦ c).
Note that g is a mapping g : S3 → S. Now extend g to g : S3 → S by:

g( x, y, z) = ∑ g(a, b, c)xa yb zc
a,b,c

If you imagine the n × n × n cube indexed by S3 , with each position ( a, b, c) filled with the entry
g( a, b, c), then g( x, y, z) is the sum of the entries in the combinatorial subcube of positions where
x a = 1, yb = 1, zc = 1. (We say “combinatorial” only to emphasize that unlike a physical cube, here
the slices that participate in the subcube are not in any sense adjacent.)
Fix ( a0 , b0 , c0 ) to be any nonassociative triple of S.
Partition S3 into blocks of eight triples apiece, as follows. Each of these blocks is indexed by a triple
x, y, z s.t. x a0 = 0, yb0 = 0, zc0 = 0. The eight triples are ( x + ε 1 a0 , y + ε 2 b0 , z + ε 3 c0 ) where ε i ∈ {0, 1}.
Now observe that
∑ g( x + ε 1 a0 , y + ε 2 b0 , z + ε 3 c0 ) = g( a0 , b0 , c0 )
ε 1 ,ε 2 ,ε 3

To see this, note that each of the eight terms on the LHS is, as described above, a sum of the entries
in a “subcube” of the “S3 cube”. These subcubes are closely related: there is a core subcube whose
indicator function is x × y × z, and all entries of this subcube are summed within all eight terms.
Then there are additional width-1 pieces: the entries in the region a0 × y × z occur in four terms,
as do the regions x × b0 × z and x × y × c0 . The entries in the regions a0 × b0 × z, a0 × y × c0 and
x × b0 × c0 occur in two terms, and the entry in the region a0 × b0 × c0 occurs in one term.
Since the RHS is nonzero, so is at least one of the eight terms on the LHS. 2 2
Corollary: in time O ( n2 )we can sample x, y, z u.a.r. in S and determine whether ( x ◦ y) ◦ z =
x ◦ (y ◦ z). If (S, ◦) is associative, then we get =; if (S, ◦) is nonassociative, we get 6= with probability
≥ 1/8.

36
Schulman: CS150 2018 CHAPTER 2. ALGEBRAIC FINGERPRINTING

2.3 Lecture 10 (24/Oct): Perfect matchings, polynomial identity


testing

2.3.1 Matchings
A matching in a graph G = (V, E) is a set of vertex disjoint edges; the size of the matching is the
number of edges.
Let n = |V | and m = | E|. A perfect matching is one of size n/2. A maximal matching is one to which
no edges can be added. A maximum matching is one of largest size.
How hard are the problems of finding such objects?
It is of course easy to find a maximal matching—sequentially. On the other hand, finding one on a
parallel computer is a much more interesting problem, which I hope to return to later in the course.
Returning to sequential computation: Finding a maximum matching, or deciding whether a perfect
matching exists, are interesting problems. In bipartite graphs, Hall’s theorem and the augment-
ing path method give very nice and accessible deterministic algorithms for maximum matching.
In √
general graphs the problem is harder but there are deterministic algorithms running in time
O( nm) [66, 38].

2.3.2 Bipartite perfect matching: deciding existence


The first problem we focus on here is to decide whether a bipartite graph has a perfect matching.
As noted there are nice deterministic algorithms for this problem but the randomized one is even
simpler. Write G = (V1 , V2 , E) with E ⊆ V1 × V2 . Form the V1 × V2 “variable” matrix A which has
Aij = xij if {i, j} ∈ E, and otherwise Aij = 0.
Let q be some prime power and consider the xij as variables in GF (q). The determinant of A, then,
is a polynomial in the variables xij .
Before launching into this, a word on a subtle point: what does it mean for a (multivariate) polyno-
mial p to be nonzero? Consider a polynomial over any field κ, which is to say, the coefficients of all
the monomials in the polynomial lie in κ.

Definition 29 We consider a polynomial nonzero if any monomial has a nonzero coefficient.

A stronger condition, which is not the definition we adopt, is that p( x ) 6= 0 for some x ∈ κ. Of
course this implies the condition in the definition; but it is strictly stronger, as we can see from the
example of the polynomial x2 + x over the field Z/2.
However, the conditions are equivalent in the following two cases:

1. Over infinite fields such as R or C.


This will follow from Lemma 35.
2. For multilinear polynomials. (This applies in particular to Det( A) which we are considering
now, so for Lemma 31, it wouldn’t have mattered which definition we used.) Specifically we
have:

Lemma 30 Let p(~x ) be a nonzero multilinear polynomial over field κ. Then there is a setting of the xi
to values ci in κ s.t. p(~c) 6= 0.

37
Schulman: CS150 2018 CHAPTER 2. ALGEBRAIC FINGERPRINTING

Proof: Every monomial is associated with a set of variables; choose a minimal such set. (E.g.,
if there is a constant term, then the empty set.) Assign the value 1 to all variables in this set,
and 0 to all variables outside this set. Only the chosen monomial can be nonzero, so p 6= 0 for
this assignment. 2

Lemma 31 G has a perfect matching iff Det( A) 6= 0.

(This is the “baby” version of a result of Tutte that we will see later in Theorem 36.)
Proof: Every monomial in the expansion of the determinant corresponds to a permutation. A
permutation is simply a pattern of 1’s hitting each edge and column exactly once, namely, a perfect
matching in the bipartite graph. Conversely, if some perfect matching is present, it puts a monomial
in the determinant with coefficient either 1 or −1. 2

Corollary 32 Fix any field κ. G has a perfect matching iff there is an assignment of the variables in A such
that the determinant is nonzero.

Proof: Apply Lemma 30 to Lemma 31. 2


This suggests the following exceptionally simple algorithm: compute the polynomial and see if it is
nonzero.
There’s a problem with this idea! The determinant has exponentially many monomials. This is not
a problem for computing determinants over a ring such as the integers, because even the sum of
exponentially many integers only has polynomially more bits than the largest of those integers has.
However, in this ring of multivariate polynomials (i.e., the ring κ [{ xij }] where κ = Q or κ = GF (q),
for the moment it doesn’t matter), there are exponentially many distinct terms to keep track of if
you want to write the polynomial out as a sum of monomials. Of course the determinant has a
more concise representation (namely, as “Det( A)”), but we do not know how to efficiently convert
that to any representation that displays transparently whether the polynomial is the 0 polynomial.
So we modify the original suggestion. Since we do know how to efficiently compute determinants
of scalar matrices, let’s substitute scalar values for the xij ’s. What values should we use? Random
ones.
Revised Algorithm: Sample the xij ’s u.a.r. in GF (q); call the sampled matrix A R . Compute Det( A R );
report “G has/hasn’t a perfect matching” according to whether Det( A R ) 6= 0 or = 0.

Det(variables)
substitute / Det(scalars)

expand evaluate
 
monomials(variables) / value of Det
substitute

Figure 2.1: This diagram commutes, but for a fast commute, go right and then down.

Clearly the algorithm answers correctly if there is no perfect matching, and it is fast (see Fig. 2.3.2).
What needs to be shown is that the probability of error is small if there is a perfect matching (and
q is large enough). So this is an RP-type algorithm for “G has a perfect matching”.

Theorem 33 The algorithm is error-free on bipartite graphs lacking a perfect matching, and the probability
of error of the algorithm on bipartite graphs which have a perfect matching is at most n/q. The runtime of
the algorithm is nω +o(1) , where ω is the matrix multiplication exponent.

38
Schulman: CS150 2018 CHAPTER 2. ALGEBRAIC FINGERPRINTING

All we have to do, then, is use a prime power q ≥ 2n in order to have error probability ≤ 1/2.
Incidentally, there is always a prime 2n ≤ q < 4n; this is called “Bertram’s postulate”. This fact
alone isn’t quite strong enough for if we want to find a prime in the right size range efficiently, but
that too can be done, in a slightly larger range. (The density of primes of this size is about 1/ log n
so we don’t have to try many values before we should get lucky; and note, primality testing has
efficient algorithms in ZPP and even somewhat less efficient algorithms in P [4].) However, we don’t
have to work this hard, since we’re satisfied with prime powers rather than primes. We can simply
use the first power of 2 after 2n.
We will prove Theorem 33 after introducing a general useful tool.

2.3.3 Polynomial identity testing


In the previous section we saw that testing for existence of a perfect matching in a bipartite graph
can be cast as a special case of the following problem. We are given a polynomial p( x ), of total
degree n, in variables x = ( x1 , . . . , xm ), m ≥ 1. (The total degree of a monomial is the sum of the
degrees of the variables in it; the total degree of a polynomial is the greatest total degree of its
monomials.) We are agnostic as to how we are “given” the polynomial, and demand only that we
be able to quickly evaluate it at any scalar assignment to the variables. We wish to test whether the
polynomial is identically 0, and our procedure for doing so is to evaluate it at a random point and
report “yes” if the value there is 0. We rely on the following lemma. Let z( p) be the set of roots
(zeros) of a polynomial p.

Lemma 34 Let p be a nonzero polynomial over GF (q), of total degree n in m variables. Then |z( p)| ≤
nqm−1 .

As a fraction, this is saying that |z( p)|/qm ≤ n/q, and in this form the lemma immediately implies
Theorem 33.
The univariate case of the lemma is probably familiar to you.
The lemma is a special case of the following more general statement which holds for any, even
infinite, field κ.

Lemma 35 Let p be a nonzero polynomial over a field κ, of total degree n in variables x1 , . . . , xm . Let
S1 , . . . , Sm be subsets of κ with |Si | ≤ s for all i. Then |z( p) ∩ (S1 × . . . × Sm )| ≤ sm−1 n.

This is usually known as the Schwartz-Zippel lemma [77, 90], although the results in these two
publications were not precisely equivalent, and there were at least two other discoveries of versions
of the result, by Ore [62] and by DeMillo and Lipton [26]. A generalization beyond polynomials is
due to Gonnet [41].
Recalling the two candidate definitions of what it means for a polynomial to be nonzero, since in
Defn 29 we chose the weaker condition, Lemma 35 is stronger than it would be otherwise.
Proof: of Lemma 35: The lemma is trivial if n ≥ s, so suppose n < s.
First consider the univariate case, m = 1. (In this case the two lemmas are identical since any set
S1 is a product set.) This follows by induction on n because if n ≥ 1 and p(α) = 0, then p can be
factored as p( x ) = ( x − α) · q( x ) for some q of degree n − 1. (Because, make the change of variables
to x − α. After this change the polynomial cannot have any constant term. So we can factor out
( x − α).)

39
Schulman: CS150 2018 CHAPTER 2. ALGEBRAIC FINGERPRINTING

Next we handle m > 1 by induction. If x1 does not appear in p then the conclusion follows from
the case m − 1. Otherwise, write p in the form p( x ) = ∑0n x1i pi ( x2 , . . . , xm ), and let 0 < i ≤ n be
largest such that pi 6= 0. The degree of pi is at most n − i, so by induction,

|z( pi ) ∩ (S2 × . . . × Sm )| n−i



s m −1 s
Let r be the LHS, i.e., the fraction of S2 × . . . × Sm that are roots of pi .
For ( x2 , . . . , xm ) ∈ z( pi ) we allow as a worst case that all choices of x1 ∈ S1 yield a zero of p.
For ( x2 , . . . , xm ) ∈
/ z( pi ), p restricts to a nonzero polynomial of degree i in the variable x1 , so by the
case m = 1,
|z( pi ) ∩ (S1 × x2 × . . . × xm )| i

s s

Since si ≤ ns < 1, the weighted average of our two bounds (on the fraction of roots in sets of the
form S1 × x2 × . . . × xm ) is worst when r is as large as possible. Thus

z ( p ) ∩ ( S1 × . . . × S m ) i
≤ r · 1 + (1 − r ) ·
sm s
n−i n−i i
≤ · 1 + (1 − )·
s s s
n i (n − i )
= −
s s
n

s
2
Comment: This lemma gives us an efficient randomized way of testing whether a polynomial is
identically zero, and naturally, people have wondered whether there might be an efficient deter-
ministic algorithm for the same task. So far, no such algorithm has been found, and it is known
that any such algorithm would have hardness implications in complexity theory that are currently
out of reach [52]1 .

1 Specifically: If one can test in polynomial time whether a given arithmetic circuit over the integers computes the zero

polynomial, then either (i) NEXP 6⊆ P/poly or (ii) the Permanent is not computable by polynomial-size arithmetic circuits.

40
Schulman: CS150 2018 CHAPTER 2. ALGEBRAIC FINGERPRINTING

2.4 Lecture 11 (26/Oct): Perfect matchings in general graphs. Par-


allel computation. Isolating lemma.

2.4.1 Deciding existence of a perfect matching in a graph


Bipartite graphs were handled last time. Now we consider general graphs.
Deterministically, deciding the existence of a perfect matching in a general graph is harder than
the same problem in a bipartite graph. (As noted, we have poly-time algorithms, but not nearly so
simple ones.) With randomization, however, we can adapt the same approach to work with almost
equal efficiency.
We must define the Tutte matrix of a graph G = (V, E). Order the vertices arbitrarily from 1, . . . , n
and set 
 0 if {i, j} ∈/E
Tij = xij if {i, j} ∈ E and i < j
− x ji if {i, j} ∈ E and i > j

Theorem 36 (Tutte [86]) Det( T ) 6= 0 iff T has a perfect matching.

This determinant is not multilinear in the variables, so the lemma from last time does not apply.
Proof: If T has a perfect matching, assign xij = 1 for edges in the matching, and 0 otherwise. Each
matching edge {i, j} describes a transposition of the vertices i, j. With this assignment every row
and column has a single nonzero entry corresponding to the matching edge it is part of, so the
matrix is the permutation matrix (with some signs) of the involution that transposes the vertices on
each edge. Since a transposition has sign −1 and there is a single −1 in each pair of nonzero entries,
the contribution of each transposition to the determinant is 1, and overall we have Det( T ) = 1.
Conversely suppose Det( T ) 6= 0 as a polynomial. Consider the determinant as a signed sum over
permutations. The net contribution to the determinant from all permutations having an odd cycle
is 0, for the following reason. In each such cycle identify the “least” odd cycle by some criterion,
e.g., ordering the cycles by their least-indexed vertex. Then flip the direction of the least odd cycle.
This map is an involution on the set of permutations. It carries the permutation to another, which
contributes the opposite sign to the determinant, since the sign of all edges in the cycle flipped.
(Figure 2.2.)

   
... ... ... ... ... ... ... ...
 ... ... ... ...   ... ... ... ... 
   
 ...
 ... x34 ... 
 vs.  ...
 ... ... x35 
 (2.2)
 ... ... ... x45   ... ... − x34 ... 
... ... − x35 ... ... ... . . . − x45

Figure 2.2: Flipping a cycle among vertices 3, 4, 5. Preserves permutation sign; reverses signs of
cycle variables.

Therefore there are permutations of the vertices, supported by T (i.e., each vertex is mapped to a
destination along one of the edges incident to it, that is, π (i ) = j ⇒ Tij 6= 0), having only even
cycles. The even cycles of length 2 are matching edges, and in any even cycle of length greater
than 2, we can use every alternate edge; altogether we obtain a perfect matching. See Figure 2.3
for a graph having perfect matchings, and two permutations from which we can read off perfect
matchings. 2

41
Schulman: CS150 2018 CHAPTER 2. ALGEBRAIC FINGERPRINTING

 
x12 x13 x14
 − x12 x23 

 − x13
 (2.3)
− x23 x34 
− x14 − x34
   
∗ ∗
 ∗   ∗ 
  or   (2.4)
 ∗   ∗ 
∗ ∗

Figure 2.3: A graph and its Tutte matrix with two different permutations π from which we can read
off perfect matchings: the involution (12)(34) and the 4-cycle (1234).

In exactly the same way as for the bipartite case, this yields:

Theorem 37 The algorithm to determine existence of a perfect matching in a graph on n vertices, is error-
free on graphs lacking a perfect matching, and the probability of error of the algorithm on graphs which
have a perfect matching, is at most n/q. The runtime of the algorithm is nω +o(1) , where ω is the matrix
multiplication exponent.

By self-reducibility this immediately yields an Õ(nω +2 )-time algorithm for finding a perfect match-
ing. (Remove an edge, see if there is still a perfect matching without it,. . . )

2.4.2 Parallel computation


There are two major processes at work in the above algorithm: determinant computations, and
sequential branching used in the self-reducibility argument. As we discuss in a moment, the linear
algebra can be parallelized. But the branching is inherently sequential.
Nevertheless, we will shortly see a completely different algorithm, that avoids this sequential
branching. In this way we’ll put the problem of finding a perfect matching, in RNC.
NCi = problems solvable deterministically by poly(n) processors in logi n time. (Equivalently, by
poly(n) -size, logi n -depth circuits.)
NCi
S
NC = i
RNCi , RNC = same but the processors (gates) may use random bits.
(Note, we are glossing over the “uniformity” conditions of the complexity classes.)

2.4.3 Sequential and parallel linear algebra


In sequential computation, there are reductions showing that matrix inversion and multiplication
have the same time complexity (up to a factor of log n), and that determinant is no harder than
these.

42
Schulman: CS150 2018 CHAPTER 2. ALGEBRAIC FINGERPRINTING

In parallel computation, the picture is actually a little simpler. Matrix multiplication is in NC1
(right from the product definition, since we can use a tree of depth log n to sum the n terms of a
row-column inner product). Matrix inversion and determinant are in NC2 , due to Csanky [24] (over
fields of characteristic 0) (and using fewer processors in RNC2 by Pan [72]); the problem is also in
NC over general fields [12, 19].
Csanky’s algorithm builds on the result of Valiant, Skyum, Berkowitz and Rackoff [87] that any
deterministic sequential algorithm computing a polynomial of degree d in time m can be converted
to a deterministic parallel algorithm computing the same polynomial in time O((log d)(log d +
log m)) using O(d6 m3 ) processors.
For a good explanation of Csanky’s algorithm see [59] §31, and for more on parallel algorithms see
[61].

2.4.4 Finding perfect matchings in general graphs. The isolating lemma


We now develop a randomized method of Mulmuley, U. Vazirani and V. Vazirani [69] to find a
perfect matching if one exists. A polynomial time algorithm is implied by the previous testing
method along with self-reducibility of the perfect matching decision problem. However, with the
following method we can solve the same problem in parallel, that is to say, in polylog depth on
polynomially many processors. This is not actually the first RNC algorithm for this task—that is an
RNC3 method due to [54]—but it is the “most parallel” since it solves the problem in RNC2 .
A slight variant of the method yields a minimum weight perfect matching in a weighted graph that
has “small” weights, that is, integer weights represented in unary; and there is a fairly standard
reduction from the problem of finding a maximum matching to finding a minimum weight perfect
matching in a graph with weights in {0, 1}. So we actually through this method can find a maximum
matching in a general graph, with a similar total amount of work.
There are really two key ingredients to this algorithm. The first, which we have already noted, is
that all basic linear algebra problems can be solved in NC2 .
The second ingredient, which will be our focus, is the following lemma. First some notation. Let
A = { a1 , a2 , . . . , am } be a finite set. If a1 , . . . , am are assigned weights w1 , . . . , wm , the weight of set
S ⊆ A is defined to be w(S) = ∑ ai ∈S wi . Let S = {S1 , . . . , Sk } be a collection of subsets of A. Let
min(S : w) ⊆ S be the collection of those S of least weight in S . We are interested in the event that
the least weight is uniquely attained, i.e., the event that | min(S : w)| = 1.

Lemma 38 ([69] Isolating Lemma, based on improved version in [84]) Let the weights w1 , . . . , wm be
independent random variables, each wi being sampled uniformly in a set Ri ⊆ R, | Ri | ≥ r ≥ 2. Then

Pr(| min(S : w)| = 1) ≥ (1 − 1/r )m . (2.5)

This lemma is remarkable because of the absence of a dependence on k, the size of the family, in
the conclusion.

43
Schulman: CS150 2018 CHAPTER 2. ALGEBRAIC FINGERPRINTING

2.5 Lecture 12 (29/Oct.): Isolating lemma, finding a perfect match-


ing in parallel

2.5.1 Proof of the isolating lemma


Proof: Write Ri = {ui (1), . . . , ui (| Ri |)} with ui ( j) < ui ( j + 1) for all 1 ≤ j ≤ | Ri | − 1.
Think of u as the mapping ∏ ui where ui ( j) is the evaluation of the function ui at 1 ≤ j ≤ | Ri |.
Let V = ∏1m {1, . . . , | Ri |} and V 0 = ∏1m {2, . . . , | Ri |}. Any composition u ◦ v is a weight function on
A, and if v ∈ V 0 then this weight function avoids using the weights ui (1).
Note |V 0 |/|V | = ∏1m (1 − 1/| Ri |) ≥ (1 − 1/r )m .
Given v ∈ V 0 , fix a set T ∈ min(S : u ◦ v) of largest cardinality. Define φ : V 0 → V by
(
v (i ) − 1 : i ∈ T
φv (i ) =
v (i ) : otherwise

We claim that min(S : u ◦ φv ) = { T } and that φ is a bijection. Observe that for any S ∈ S ,

(u ◦ v)(S) − (u ◦ φv )(S) = ∑ (ui (v(i )) − ui (v(i ) − 1)).


i ∈S∩ T

with every summand on the RHS being positive. In particular (u ◦ v)( T ) − (u ◦ φv )( T ) is the largest
change in weight possible for any S, and is achieved by S only if T ⊆ S.
Because T has largest cardinality among sets in min(S : u ◦ v), no other set of min(S : u ◦ v) can
contain T, and therefore T decreases its weight by strictly more than any other set of min(S : u ◦ v).
Other sets of S might have their weight decrease by the same amount as T, but not more. So,
min(S : u ◦ φv ) = { T } as desired.
Consequently also T can be identified as the unique min-weight element of min(S : u ◦ φv ). So φ
can be inverted. (Keep in mind, at different v in the domain in φ, different sets T get used, so in
order to invert φ we need to be able to identify T just from seeing the mapping φv . (And of course,
u, which is fixed.) ) Thus |φ(V 0 )| = |V 0 |. So, with v sampled uniformly,

Pr(| min(S : u ◦ v)| = 1) ≥ Pr(v ∈ φ(V 0 )) = Pr(v ∈ V 0 ) ≥ (1 − 1/r )m .

2.5.2 Finding a perfect matching, in RNC


Now we describe the algorithm to find a perfect matching (or report that probably none exists) in a
graph G = (V, E) with n = |V |, m = | E|.
For every (i, j) ∈ E pick an integer weight wij iid uniformly distributed in {1, . . . , 2m}. By the
1 m
isolating lemma, if G has any perfect matchings, then with probability at least (1 − 2m ) ≥ 1/2 it
obtains a unique minimum weight perfect matching.
Define the matrix T by: 
 0 if {i, j} ∈
/E
Tij = 2wij if {i, j} ∈ E, i < j (2.6)
−2w ji if {i, j} ∈ E, i > j

This is an instantiation of the Tutte matrix, with xij = 2wij .

44
Schulman: CS150 2018 CHAPTER 2. ALGEBRAIC FINGERPRINTING

Claim 39 If G has a unique minimum weight perfect matching of G (call it M and its weight w( M )) then
Det( T ) 6= 0 and moreover, Det( T ) = 22w( M) × [an odd number].

Proof: of Claim: As before we look at the contributions to Det( T ) of all the permutations π that are
supported by edges of the graph. The contributions from permutations having odd cycles cancel
out—that is just because this is a special case of a Tutte matrix.
It remains to consider permutations π that have only even cycles.

• If π consists of transpositions along the edges of M then it contributes 22w( M) .

• If π has only even cycles, but does not correspond to M, then:


0
– If π is some other matching M0 of weight w( M0 ) > w( M) then it contributes 22w( M ) .
– If π has only even cycles and at least one of them is of length ≥ 4, then by separating
each cycle into a pair of matchings on the vertices of that cycle, π is decomposed into
two matchings M1 6= M2 of weights w( M1 ), w( M2 ), so π contributes ±2w( M1 )+w( M2 ) .
Because of the uniqueness of M not both of M1 and M2 can achieve weight w( M), so
w( M1 ) + w( M2 ) > 2w( M). 2

Now let T̂ij be the (i, j)-deleted minor of T (the matrix obtained by removing the i’th row and j’th
column from T), and set
n
mij = ∑ sign(π ) ∏ Tk,π (k)
π:π (i )= j k =1 (2.7)
wij
= ±2 Det( T̂ij )

Claim 40 For every {i, j} ∈ E:

1. The total contribution to mij of permutations π having odd cycles is 0.


2. If there is a unique minimum weight perfect matching M, then:

(a) If {i, j} ∈ M then mij /22w( M) is odd.


/ M then mij /22w( M) is even.
(b) If {i, j} ∈

Proof: of Claim: This is much like our argument for Det( T ) but localized.

1. If π has an odd cycle then it has an even number of odd cycles and hence an odd cycle not
containing point i. Pick the “first” odd cycle that does not contain point i and flip it to obtain
a permutation π r . Note that (π r )r = π. The contribution of π r to mij is the negation of the
contribution of π to mij , because we have replaced an odd number of terms from the Tutte
matrix by the same entry with a flipped sign.

2. By the preceding argument, whether or not {i, j} ∈ M, we need only consider permutations
containing solely even cycles. Just as argued for Claim 39, the contribution of every such
permutation π can be written as 2w( M1 )+w( M2 ) , where M1 and M2 are two perfect matchings
obtained as follows: each transposition (i0 , j0 ) in π puts the edge {i0 , j0 } into both of the
matchings; each even cycle of length ≥ 4 can be broken alternatingly into two matchings, one
of which (arbitrarily) is put into M1 and one into M2 .

45
Schulman: CS150 2018 CHAPTER 2. ALGEBRAIC FINGERPRINTING

The only case in which there is a term for which w( M1 ) + w( M2 ) = 2w( M ) is the single case
that {i, j} ∈ M and π consists entirely of transpositions along the edges of M. In every other
case, at least one of M1 or M2 is distinct from M, and therefore w( M1 ) + w( M2 ) > 2w( M).
The claim follows. 2

Finally we collect all the elements necessary to describe the algorithm:

1. Generate the weights wi uniformly in {1, . . . , 2m}.

2. Define T as in Eqn (2.6), compute its determinant and if it is nonsingular invert it. (Otherwise,
start over.) This determinant computation and the inversion can be done (deterministically)
in depth O(log2 n) as discussed earlier.
3. Determine w( M ) by factoring the greatest power of 2 out of Det( T ).

4. Obtain the values ±mij from the equations mij = ±2wij Det( T̂ij ) and

Det( T̂ij ) = (−1)i+ j ( T −1 ) ji Det( T ) (Cramer’s rule)

If mij /22w( M) is odd then place {i, j} in the matching.


5. Check whether this defines a perfect matching. This is guaranteed if the minimum weight
perfect matching is unique. If a perfect matching was not obtained (which will occur for
sure if there is no perfect matching, but with probability ≤ 1/2 if there is one), generate new
weights and repeat the process.

Of course, if the graph has a perfect matching, the probability of incurring k repetitions wihout
success is bounded by 2−k , and the expected number of repetitions until success is at most 2.
The simultaneous computation of all the mij ’s in step 2 is key to the efficiency of this procedure.
The numbers in the matrix A are integers bounded by ±22m . Pan’s RNC2 matrix inversion algorithm
will compute A−1 using O(n3.5 m) processors.
For the maximum matching problem, we use a simple reduction: use weights for each of the non-
edges too, but sample those weights uniformly from 2mn + 1, . . . , 2mn + 2m (rather than 1, . . . , 2m
like the graph edges). Then no minimum weight perfect matching will use any of the non-edges.
The cost of this reduction is that the integers in the matrix now use O(mn) rather than O(m) bits,
so the number of processors used by the maximum matching algorithm is O(n4.5 m).

46
Chapter 3

Concentration of Measure

3.1 Lecture 13 (31/Oct): Independent rvs, Chernoff bound, appli-


cations

3.1.1 Independent rvs


Lemma 41 If X1 , . . . , Xn are independent real rvs with finite expectations (recall this assumption requires
that the integrals converge absolutely), then

E ( ∏ Xi ) = ∏ E ( Xi ) .
This is a consequence of the fact that the probabilities of independent events multiply; one only has
to be careful about the measure theory. It is enough to consider the case n = 2 and proceed by
induction. Recall the definition of expectation from Eqn 1.7:

E( X ) = lim
h →0
∑ jh Pr( jh ≤ X < ( j + 1)h)
integer−∞< j<∞

and apply

Pr(( jh ≤ X < ( j + 1)h) ∧ ( j0 h ≤ Y < ( j0 + 1)h))


= Pr( jh ≤ X < ( j + 1)h) · Pr( j0 h ≤ Y < ( j0 + 1)h)

for independent X, Y.
If you want to do the measure theory carefully, this boils down to the Fubini Theorem.

3.1.2 Chernoff bound for uniform Bernoulli rvs (symmetric random walk)

The Chernoff bound1 will be one of two ways in which we’ll display the concentration of measure
phenomenon, the other being the central limit theorem. In the types of problems we’ll be looking
at the Chernoff bound is the more frequently useful of the two but they’re closely related.
Let’s begin with the special case of iid fair coins, aka iid uniform Bernoulli rvs: P( Xi = 1) =
1/2, P( Xi = 0) = 1/2. Put another way, we have n independent events, each of which occurs with
1 First due to Bernstein [14, 15, 13] but we follow the standard naming convention in Computer Science.

47
Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

probability 1/2. We want an exponential tail bound on the probability that significantly more than
half the events occur. This very short argument is the seed of more general or stronger bounds that
we will see later.
It will be convenient to use the rvs Yi = 2Xi − 1, where Xi is the indicator rv of the ith event. This
shift lets us work with mean-0 rvs. This leaves the Yi independent; that is a special case of the
following lemma, which is an immediate consequence of the definitions in Sec. 1.2:
Lemma 42 If f 1 , . . . are measurable functions and X1 , . . . are independent rvs then f 1 ( X1 ), . . . are indepen-
dent rvs.
(Proof omitted.)

Theorem 43 Let Y1 , . . . , Yn be iid rvs, with Pr(Yi = −1) = Pr(Yi = 1) = 1/2. Let Y = ∑1n Yi . Then
√ 2
Pr(Y > λ n) < e−λ /2 for any λ > 0.
√ p
The significance of n here is that it is the standard deviation of Y (i.e., Var(Y )), because (a)
E(Yi ) = 1 (easy), and (b):
Exercise:
If Z1 , . . . , Zn are independent (actually pairwise independent is enough) real rvs with well defined
first and second moments, then
Var(∑ Zi ) = ∑ Var( Zi ). (3.1)

Consequently, we already know from the Chebyshev bound, Lemma 13, to “expect” n to be the
regime where we start to get a meaningful deviation bound.
Proof: Fix any α > 0. Exercise:2
2 /2
E(eαYi ) = cosh α ≤ eα .
By independence of the rvs eαYi ,

∏ E(eαXi ) ≤ enα /2 .
2
E(eαY ) =

√ √
n
Pr(Y > λ n) = Pr(eαY > eαλ )

αλ n
≤ E(e )/e αY
Markov ineq.

nα2 /2−αλ n
≤ e

We now optimize this bound by making the choice α = λ/ n, and obtain:
√ 2
Pr(Y > λ n) ≤ e−λ /2 .
2
Here’s another way to think about this calculation: Let s x (y) be the step function s x (y) = 1 for
y > x, s x (y) = 0 for y ≤ x. Note, for any α > 0, s x (y) ≤ exp(α(y − x )), which is to say, the
exponential kernel of integration, is greater than the threshold kernel of integration. (See Fig. 3.1.2.)

Pr(Y > λ n) = E(sλ√n (Y ))

≤ E(exp(α(Y − λ n))
n √
= ∏ E(exp(α(Yi − λ/ n)))
1
 √ n
= e−αλ/ n
cosh α
2 /2
2 For k ≥ 0, (2k )! = ∏1k i (k + i ) ≥ 2k k!, so for any real x, e x = ∑k≥0 x2k /(2k k!) ≥ ∑k≥0 x2k /(2k)! = cosh x.

48
Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

1.5

1.0
Threshold kernel
Exponential kernel
Probability mass

0.5

-2 -1 1 2

Figure 3.1: Integrating a probability mass against two different nonnegative kernels


We get the best upper bound by minimizing the base b of this exponent. If we pick α = λ/ n,
2 √
which doesn’t exactly optimize the bound but comes close, we get b = e−λ /n cosh(λ/ n) ≤
2 2 2
e−λ /n eλ /2n = e−λ /2n . Then substituting back we get
√ 2
Pr (Y > λ n) ≤ e−λ /2 .

3.1.3 Application: set discrepancy


For a function χ : {1, . . . , n} → {1, −1} and a subset S of {1, . . . , n}, let χ(S) = ∑i∈S χ(i ). Define
the discrepancy of χ on S to be Disc(S, χ) = |χ(S)|, and the discrepancy of χ on a collection of sets
S = {S1 , . . . , Sn } to be Disc(S, χ) = max j |χ(S j )|.


Theorem 44 (Spencer) With the definitions above, ∃χ with Disc(S, χ) ∈ O( n).

We won’t provide Spencer’s argument, but the starting point for it is the proof of the following
weaker statement.

Theorem 45 With the definitions above, a function χ selected u.a.r. has Disc(S, χ) ∈ O(
p
n log n) with
positive probability.

Proof: By Theorem (43), for any particular set S j (noting that |S j | ≤ n),
p
p c n log n q
Pr(|χ(S j )| > c n log n) = Pr(|χ(S j )| > q |S j |)
|S j |
−c2 n log n
2| S j |
≤ 2e
c2 log n
≤ 2e− 2
2 /2
= 2n−c .

49
Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

Now take a union bound over the sets.


p p
Pr(∃ j : |χ(S j )| > c n log n) ≤ n Pr(|χ(S j )| > c n log n)
1−c2 /2
< 2n .

Plug in any c > 2 to show the theorem for sufficiently large values of n. 2

3.1.4 Entropy and Kullback-Liebler divergence


When we introduced BPP we specified that at the end of the poly-time computation, strings in the
language should be accepted with probability ≥ 2/3, and strings not in the language should be
accepted with probability ≤ 1/3. We also noted that these values were immaterial and did not even
need to be constants—we need only that they be separated by some 1/poly. We’ll shortly see why.
We start by defining two important functions.

1
Definition 46 The entropy (base 2) of a probability distribution { p1 , . . . , pn } is h2 ( p) = ∑ pi lg pi .

1
In natural units we use h( p) = ∑ pi log pi .

Definition 47 Let r = (r1 , . . . , rn ) and s = (s1 , . . . , sn ) be two probability distributions and suppose
si > 0∀i. The (base 2) Kullback-Leibler divergence D2 (r ks) “from s to r,” or “of r w.r.t. s,” is defined
by
r
D2 (r ks) = ∑ ri lg i
i
si

This is also known as information divergence, directed divergence or relative entropy3 . In natural
log units the divergence is D (r ks) = ∑i ri log rsi , and we also use this notation when the base doesn’t
i
matter. D (r ks) is not a metric (it isn’t symmetric and doesn’t satisfy the triangle inequality) but it
is nonnegative, and zero only if the distributions are the same.
Exercise:

(a) D (r ks) ≥ 0 ∀r, s


(b) D (r k s ) = 0 ⇒ r = s
!
ε2i
(c) D (s + εks) = ∑ 2si
+ O(ε3i )
i
(d) for n = 2, D ((s1 + ε, 1 − s1 − ε)k(s1 , 1 − s1 )) is increasing in |ε|

The “k” notation is strange but is the convention.


From (c) and (d) we have that for n = 2, D ((s1 + ε, 1 − s1 − ε)k(s1 , 1 − s1 )) ∈ Ω(ε2 ) (with the
constant depending on s1 ).
When s is the uniform distribution, we have:

D (r kuniform) = ∑ ri log(nri ) = lg n + ∑ ri log ri = log n − h(r)


So D (r kuniform) can be thought of as the entropy deficit of r, compared to the uniform distribution.
1
In the case n = 2 we will write p rather than ( p, 1 − p), thus: h2 ( p) = p lg p + (1 − p) lg 1−1 p ,
D2 ( pkq) = p lg
p
q + (1 − p) lg 11− p
−q .
3D is useful throughout information theory and statistics (and is closely related to “Fisher information”). See [23].

50
Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

0.7

0.6
x 2 /2
D((1+x)/2 || 1/2) 0.5

0.4

0.3

0.2

0.1

x
- 1.0 - 0.5 0.5 1.0

Figure 3.2: Comparing the two Chernoff bounds at q = 1/2

3.2 Lecture 14 (2/Nov): Stronger Chernoff bound, applications

3.2.1 Chernoff bound using divergence; robustness of BPP


Let’s extend and improve the previous large deviation bound for symmetric random walk. The
new bound is almost the same for relatively
√ mild deviations (just a few standard deviations) but
is much stronger at many (especially, Ω( n)) standard deviations. It also does not depend on the
coins being fair.

Theorem 48 If X1 , . . . , Xn are iid coins each with probability q of being heads, the probability that the
number of heads, X = ∑ Xi , is > pn (for p ≥ q) or < pn (for p ≤ q), is < 2−nD2 ( pkq) = exp(−nD ( pkq)).

n
Exercise: Derive from the above one side of Stirling’s approximation for ( pn ).
Note 1: this improves on Thm 43 even at q = 1/2 because the inequality cosh α ≤ exp(α2 /2)
that we used before, though convenient, was wasteful. (But the two bounds converge for p in the
neighborhood of q.) Specifically we have (see Figure 3.2):

D ( pk1/2) ≥ (2p − 1)2 /2 (3.2)

Note 2: The divergence is the correct constant in the above inequality; and this remains the case even
when we “reasonably” extend this inequality to alphabets larger than 2—that is, dice rather than
coins; see Sanov’s Theorem [23, 75]. There are of course lower-order terms that are not captured by
the inequality.
Note 3: Let’s see what we mean by “concentration of measure”. Clearly, the Chernoff bound is
telling us that something, namely the rv X, is very tightly concentrated about a particular value.
On the other hand, if you look at the full underlying rv, namely the vector ( X1 , . . . , Xn ), that is not
concentrated at all; if say q = 1/2, then it is actually as smoothly distributed as it could be, being
uniform on the hypercube! The concentration of measure phenomenon, then, is a statement about
low dimension representation of high dimensional objects. In fact the “representation” does not have to be
a nice linear function like X = ∑ Xi . It is sufficient that f ( X1 , . . . , Xn ) be a Lipschitz function, namely
that that there be some constant bound c s.t. flipping any one of the Xi ’s changes the function value
by no more than c. From this simple information you can already get a large deviation bound on f
for independent inputs Xi .

51
Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

Proof: Consider the case p ≥ q; the other case is similar. Set Yi = Xi − q and Y = ∑ Yi . Now for
α > 0,

Pr(Y > n( p − q)) = Pr(eαY > eαn( p−q) )


< E(eαY )/eαn( p−q) Markov
!n
(1 − q)e−αq + qeα(1−q)
= Independence
eα( p−q)

p (1− q )
Set α = log (1− p)q . Continuing,
!n
1 − q 1− p
 p  
q
=
p 1− p
= e−nD( pkq)

This is saying that the probability of a coin of bias q empirically “masquerading” as one of bias at
least p > q, drops off exponentially, with the coefficient in the exponent being the divergence.

Back to BPP

Suppose we start with a randomized polynomial-time decision algorithm for a language L which
for x ∈ L, reports “Yes” with probability at least p, and for x ∈
/ L, reports “Yes” with probability at
most q, for p = q + 1/ f (n) for some f (n) ∈ nO(1) .
Also, D (q + εkq) is monotone in each of the regions ε > 0, ε < 0.
So if we perform O(n f 2 (n)) repetitions of the original BPP algorithm, and accept x iff the fraction
of “Yes” votes is above ( p + q)/2, then the probability of error on any input is bounded by exp(−n).

3.2.2 Balls and bins


Suppose you throw n balls, uniformly iid, into n bins. What is the highest bin occupancy?
Let Ai = # balls in bin i. Claim: ∀c > 1, Pr(max Ai > c log n/ log log n) ∈ o (1).
To avoid a morass of iterated logarithms, write L = log n, L2 = log log n, L3 = log log log n. So we
wish to show Pr(max Ai > cL/L2 ) ∈ o (1).
Proof: by the union bound,

Pr(max Ai > cL/L2 )


≤ n Pr( Ai > cL/L2 )
cL 1
≤ n exp(−nD ( k ))
nL2 n
  cL !(1− cL )n
nL2
L2 L2 1 − 1/n
= n cL
cL 1 − nL 2
  cL !(1− cL )n
nL2
L 2 L2 1
≤ n cL
cL 1 − nL 2

52
Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

2.0

1.8

1.6

1.4

1.2

0.2 0.4 0.6 0.8 1.0

Figure 3.3: ( 1−1 p )1− p vs. 1 + p

Expand the first term and apply the inequality4 ( 1−1 p )1− p ≤ e p (0 ≤ p < 1) to the second term:
 
cL cL
... ≤ exp L + ( L3 − L2 − log c) +
L2 L2
L3 − log c + 1
= exp((1 − c) L + cL )
L2
≤ exp((1 − c) L + o (1)) = n1−c+o(1) .

Omitted: Show that a constant fraction of bins are unoccupied.


A much more precise analysis of this balls-in-bins process is available, due to G. Gonnet [40].

3.2.3 Preview of Shannon’s coding theorem


This is an exceptionally important application of large deviation bounds. Consider one party (Alice)
who can send a bit per second to another party (Bob). She wants to send him a k-bit message. How-
ever, the channel between them is noisy, and each transmitted bit may be flipped, independently,
with probability p < 1/2. What can Alice and Bob do? You can’t expect them to communicate
reliably at 1 bit/second anymore, but can they achieve reliable communication at all? If so, how
many bits/second can they achieve? This question turns out to have a beautiful answer that is the
starting point of modern communication theory.
Before Shannon came along, the only answer to this question was, basically, the following naı̈ve
strategy: Alice repeats each bit some ` times. Bob takes the majority of his ` receptions as his best
guess for the value of the bit.
We’ve already learned how to evaluate the quality of this method: Bob’s error probability on each
bit is bounded above by, and roughly equal to, exp(−` D (1/2k p)). In order for all bits to arrive
correctly, then, Alice must use ` proportional to log k. This means the rate of the communication,
the number of message bits divided by elapsed time, is tending to 0 in the length of the message
(scaling as 1/ log k). And if Alice and Bob want to have exponentially small probability of error
exp(−k ), she would have to employ ` ∼ k, so the rate would be even worse, scaling as 1/k.
Shannon showed that in actual fact one does not need to sacrifice rate for reliability. This was a
great insight, and we will see next time how he did it. Roughly speaking—but not exactly—his
4 In fact we have the stronger ( 1−1 p )1− p ≤ 1 + p (see Fig. 3.2.2) although we don’t need this. Let α = log 1
1− p , so α ≥ 0.
Then p = 1 − e−α and we are to show that 2 ≥ e−α + e αe−α
=: f (α). f (0) = 2 and f 0 = e−α (e αe−α
− 1 − α) so it suffices to show
−α
for α ≥ 0 that g(α) := eαe ≤ 1 + α. At α = 0 this is satisfied with equality so it suffices to show that 1 ≥ g0 = (1 − α)e−α g.
−α
Since 1 − α ≤ e−α , it suffices to show that 1 ≥ e−2α g = eα(e −2) , which holds (with room to spare) because e−α ≤ 1.

53
Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

argument uses a randomly chosen code. He achieves error probability exp(−Ω(k)) at a constant
communication rate. What is more, the rate he achieves is arbitrarily close to the theoretical limit.

54
Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

3.3 Lecture 15 (5/Nov): Application of large deviation bounds:


Shannon’s coding theorem. Central limit theorem
3.3.1 Shannon’s block coding theorem. A probabilistic existence argument.
In order to communicate reliably, Alice and Bob are going to agree in advance on a codebook, a set of
codewords that are fairly distant from each other (in Hamming distance), with the idea that when
a corrupted codeword is received, it will still be closer to the correct codeword than to all others. In
this discussion we completely ignore a key computational issue: how are the encoding and decoding
maps computed efficiently? In fact it will be enough for us, for a positive result, to demonstrate
existence of an encoding map E : {0, 1}k → {0, 1}n and a decoding map D : {0, 1}n → {0, 1}k
(we’ll call this an (n, k) code) with the desired properties; we won’t even explicitly describe what
the maps are, let alone specify how to efficiently compute them. We will call k/n the rate of such a
code. Shannon’s achievement was to realize (and show) that you can simultaneously have positive
rate and error probability tending to 0—in fact, exponentially fast.

Theorem 49 (Shannon [78]) Let p < 1/2. For any ε > 0, for all k sufficiently large, there is an (n, k ) code
with rate ≥ D2 ( pk1/2) − ε and error probability e−Ω(k) on every message. (The constant in the Ω depends
on p and ε.)

In this theorem statement, “Error” means that Bob decodes to anything different from X, and error
probabilities are taken only with respect to the random bit-flips introduced by the channel.
Proof: Let
k
n= (3.3)
D2 ( pk1/2) − ε
(ignoring rounding). Let R ∈ {0, 1}n denote the error string. So, with Y denoting the received
message,
Y = E (X) + R
with X uniform in {0, 1}k , and R consisting of iid Bernoulli rvs which are 1 with probability p. The
error event is that D(E ( X ) + R) 6= X.
As a first try, let’s design E by simply mapping each X ∈ {0, 1}k to a uniformly, independently
chosen string in {0, 1}n . (This won’t be good enough for the theorem.)
So (for now) when we speak of error probability, we have two sources of randomness: channel noise
R, and code design E .
To describe the decoding procedure we start with the notion of Hamming distance H. The Ham-
ming distance H ( x, y) between two same-length strings over a common alphabet Σ, is the number
of indices in which the strings disagree: H ( x, y) = |{i : xi 6= yi }| for x, y, ∈ Σn . Define the decoding
D to map Y to a closest codeword in Hamming distance.
For most of the remainder of the proof (in particular until after the lemma), we fix a particular
message X, and analyze the probability that it is decoded incorrectly.
In order to speak separately about the two sources of error, we define MX to be the rv (which is a
function of E ) MX = PrR (Error on X |E ). So for any E , 0 ≤ MX ≤ 1.
In order to analyze how well this works, we pick δ sufficiently small that
p + δ < 1/2 (3.4)
and
D2 ( p + δk1/2) > D2 ( pk1/2) − ε/2. (3.5)
Note that if both

55
Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

1. H (E ( X ) + R, E ( X )) < ( p + δ)n (“channel noise is low”), and

2. ∀ X 0 6= X : H (E ( X ) + R, E ( X 0 )) > ( p + δ)n (“code design is good for X, R”)

then Bob will decode correctly.


The contrapositive is that if Bob decodes X incorrectly then at least one of the following events has
to have occurred:

Bad1 : H (E ( X ) + R, E ( X )) ≥ ( p + δ)n
Bad2 : ∃ X 0 6= X : H (E ( X ) + R, E ( X 0 )) ≤ ( p + δ)n

Lemma 50 ∃c > 0 s.t. EE ( MX ) < 21−cn

Proof: Specifically we show this for c = min{ D2 ( p + δk p), ε/2}. In what follows when we write
a bound on PrW (. . .) we mean that “conditional on any setting to the other random variables, the
randomness in W is enough to ensure the bound”.

EE ( MX ) ≤ Pr(Bad1 ) +
R
∑ Pr(Bad2 )
E
X 0 6= X

≤ Pr( H (~0, R) ≥ ( p + δ)n) +


R
∑ Pr
0
( H ( R, E ( X 0 ) − E ( X )) ≤ ( p + δ)n)
X 6 = X E ( X ),E ( X )
0

≤ 2−nD2 ( p+δk p) + 2k−nD2 ( p+δk1/2)


= 2−nD2 ( p+δk p) + 2n( D2 ( pk1/2)−ε− D2 ( p+δk1/2)) substituting value of k
−nD2 ( p+δk p) −εn/2
≤ 2 +2 using inequality (3.5)
1−cn
≤ 2 using value of c

2
All of the above analysis treated an arbitrary but fixed message X. We showed that, picking the
code at random, the expected value of MX = PrR (Error on X |E ) is small.
Let Z be the rv which is the fraction of X’s for which MX ≤ 2E( MX ). By the Markov inequality, ∃E
s.t. Z ≥ 1/2. Let E ∗ be a specific such code.
E ∗ works well for most messages X, but this isn’t quite what we want—we want MX to be small
for all messages X.
There is a simple solution. Choose a code E ∗ as above for k + 1 bits, then map the k-bit messages
to the good half of the messages. Note that removal of some codewords from E ∗ can only decrease
any MX . (Assuming we still use closest-codeword decoding.)
So now the bound PrR (Error on X ) ≤ 2E( MX ) ≤ 22−cn applies to all X.
The asymptotic rate is unaffected by this trick; the error exponent is also unaffected.
To be explicit, using E ∗ designed for k + 1 bits and with n = k +1
D2 ( pk1/2)−ε
we have for all X ∈ {0, 1}k

Pr(Error on X ) ≤ 22−cn
R

Thus no matter what message Alice sends, Bob’s probability of error is exponentially small. 2

56
Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

3.3.2 Central limit theorem


As I mentioned earlier in the course, there are two basic ways in which we express concentration
of measure: large deviation bounds, and the central limit theorem. Roughly speaking the former
is a weaker conclusion (only upper tail bounds) from weaker assumptions (we don’t need full
independence—we’ll talk about this soon).
The proof of the basic CLT is not hard but relies on a little Fourier analysis and would take us too
far out of our way this lecture, so I will just quote it. Let µ be a probability distribution on R, i.e., for
X distributed as µ, measurable S ⊆ R, Pr( X ∈ S) = µ(S). For X1 , . . . , Xn sampled independently
from µ set X = n1 ∑in=1 Xi .

Theorem 51 Suppose that µ possesses both first and second moments:


Z
θ = E [X] = x dµ mean
h i Z
σ2 = E ( X − θ )2 = ( x − θ )2 dµ variance

Then for all a < b,


Z b
aσ bσ 1 2 /2
lim Pr( √ < X − θ < √ ) = √ e−t dt. (3.6)
n n n 2π a

The form of convergence to the normal distribution in 3.6 is called convergence in distribution or
convergence in law. For a proof of the CLT see [16] Sec. 27 or for a more accessible proof for the case
that the Xi are bounded, see [3] Sec. 3.8.

57
Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

3.4 Lecture 16 (7/Nov): Application of CLT to Gale-Berlekamp.


Khintchine-Kahane. Moment generating functions

3.4.1 Gale-Berlekamp game


Let’s remember a problem we saw in the first lecture (slightly retold):

• You are given an n × n grid of lightbulbs. For each bulb, at position (i, j), there is a switch bij ;
there is also a switch ri on each row and a switch c j on each column. The (i, j) bulb is lit if
bij + ri + c j is even. For a setting b, r, c of the switches, let F (b, r, c) be the number of lit bulbs
minus the number of unlit bulbs. Then F (b, r, c) = ∑ij (−1)bij +ri +c j .
Let F (b) = maxr,c F (b, r, c).
What is the greatest f (n) such that for all b, F (b) ≥ f (n)?

This is called the Gale-Berlekamp game after David Gale and Elwyn Berlekamp, who viewed it as
a competitive game: the first player chooses b and then the second chooses r and c to maximize the
number of lit bulbs. So f (n) is the outcome of the game for perfect players. In the 1960s, at Bell
Labs, Berlekamp even built a physical 10 × 10 grid of lightbulbs with bij , ri and c j switches. People
have labored to determine the exact value of f (n) for small n—see [33]. But the key issue is the
asymptotics.

Theorem 52 f (n) ∈ Θ(n3/2 ).

Proof:
First, the upper bound f (n) ∈ O(n3/2 ): We have to find a setting b that is favorable for the “mini-
mizing f ” player, who goes first. That is, we have to find a b with small F (b).
Fix any r, c. Then for b selected u.a.r.,

−n2 D2 ( 21 + 2√k n k 21 )
Pr( F (b, r, c) > kn3/2 ) ≤ 2 we’ll choose a value for k shortly
−k2 n/(2 log 2)
≤2 using D ( pk1/2) ≥ (2p − 1)2 /2

Now take a union bound over all r, c.


2
Pr( F (b) > kn3/2 ) ≤ 22n−k n/(2 log 2)

For k > 2 log 2 this is < 1. So ∃b s.t. ∀r, c, F (b, r, c) ≤ 2 log 2n3/2 .
p p

Next we show the lower bound. Here we must consider any setting b and show how to choose r, c
favorably. Initially, set all ri = 0 and pick c j u.a.r. Then for any fixed i, the row sum

∑(−1)bij +cj =: Xi
j

is binomially distributed, being an unbiased random walk of length n.


Now, unlike the Chernoff bound, we’d like to see not an upper but a lower tail bound on random
walk. Let’s derive this from the CLT:

Corollary 53 For X the sum of m uniform iid ±1 rvs, E(| X |) = (1 + o (1)) 2m/π.

58
Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

(Proof sketch: for X distributed as the unit-variance Gaussian N (0, 1), this value is exact; see [89].
The CLT shows this is a good enough approximation to our rv.)
Comment: Instead of using Corollary 53, we could alternatively have used the following result:

Theorem 54 (Khintchine-Kahane) Let a = ( a1 , . . . , an ), ai ∈ R. Let si ∈U ±1 and set S = |∑ si ai |.


Then √1 k ak2 ≤ E(S) ≤ k ak2 .
2

The original result of this form is [56]; the above constant and generality are found in [60]; for an
elegant one-page proof see [32]. Not coincidentally, both this result and the CLT are proven through
Fourier analysis.
Comment: Since we haven’t provided proofs of either of these, and we are about to use them, let
me mention that later in the course (Sec. 4.3.1) we’ll come back and finish the proof (with a weaker
constant) through a more elementary argument, and with the added benefit that we will be able to
give the player a deterministic poly-time strategy for choosing the row and column bits. (Here we
gave the player only a randomized poly-time strategy.)
In any case we now continue, using √ the conclusion (with the largest constant, coming from the
CLT): for every i, E(| Xi |) = (1 + o (1)) 2n/π.
ri
√ row, flip ri if the row sum is negative. So E(∑i (−1) Xi ) = E(∑i | Xi |) = ∑i E(| Xi |) =
Now for each
(1 + o (1)) 2/πn3/2 .

This shows (assuming the CLT) that √ for any b, Ec maxr F (b, r, c) is (1 + o (1)) 2/πn3/2 . Conse-
quently, for all b, F (b) ≥ (1 + o (1)) 2/πn3/2 , which proves the theorem. 2
Comment: It was convenient in this problem that the system of switches at your disposal was
“bipartite”, that is, there are no interactions amongst the effects of the row switches, and likewise
amongst the effects of the column switches. However, even when such effects are present it is
possible to attain similar theorems. See [53].

3.4.2 Moment generating functions, Chernoff bound for general distributions


Now for a version of the Chernoff bound which we can apply to sums of independent real rvs with
very general probability distributions.
After presenting the bound we’ll see an application of it, with broad computational applications, in
the theory of metric spaces.
Let X be a real-valued random variable with distribution µ: for measurable S ⊆ R, Pr( X ∈ S) =
µ ( S ).

Definition 55 The moment generating function (mgf) of X (or, more precisely, of µ) is defined for β ∈ R
by
h i
gµ ( β) = E e βX provided this converges in an open neighborhood of 0

βk
= ∑ k!
E( X k )
0

Incidentally note that (a) if instead of taking β to be real we take it to be imaginary, this gives the
Fourier transform, (b) both are “slices” of the Laplace transform.
For any probability measure µ,
gµ (0) = E [1] = 1. (3.7)

59
Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

We are interested in large deviation bounds for random walk with steps from µ. That is, if we
sample X1 , . . . , Xn iid from µ and take X = n1 ∑in=1 Xi , we want to know if the distribution of X is
concentrated around E [ X ]. It will be convenient to re-center µ, if necessary, so that E [ X ] = 0; clearly
this just subtracts a known constant off each step of the rw, so it does not affect any probabilistic
calculations. So without loss of generality we now take E [ X ] = 0.
Perhaps not surprisingly, the quality of the large deviation bound that is possible, depends on how
heavy the tails of µ are. What is interesting is that this is nicely measured by the smoothness of gµ at
the origin. Specifically, a moment-generating function that is differentiable at the origin guarantees
exponential tails.
One way to think about this intuitively is to examine the Fourier transform (the imaginary axis),
rather than the characteristic function, near the origin. If µ has light tails—as an extreme case
suppose µ has bounded support—then near the origin, the Fourier coefficients are picking up only
very long-wavelength information, and seeing almost no “cancellations”—negative contributions
can come only from very far away and therefore be very small. So the Fourier coefficients near 0
are vanishingly different from the Fourier coefficient at 0, and so gµ is differentiable at 0. This goes
both ways—if µ has heavy tails, then even at very long wavelengths, the Fourier integral picks up
substantial cancellation, and so the Fourier coefficients change a lot moving away from 0.

Theorem 56 (Chernoff) If the mgf gµ ( β) is differentiable at 0, then ∀ε 6= 0 ∃cε < 1 such that

Pr( X/ε > 1) < cnε .

Specifically
cε ≤ inf e− βε gµ ( β) < 1. (3.8)
β

Proof: Let N be a neighborhood of 0 in which the mgf converges. Start with the case ε > 0.

Pr( X > ε) = Pr(e β ∑i Xi > e βnε ) for any β > 0 (3.9)


h i
< e− βnε E e β ∑i Xi Markov bound, for β ∈ N
 h in
= e− βnε E e βX1 Xi are independent
 n
= e− βε gµ ( β) (3.10)

We now need to show that there is a β > 0 such that e− βε gµ ( β) < 1. At β = 0, e0 gµ (0) = 1, so let’s
find the derivative of e− βε gµ ( β) at 0. Since gµ is differentiable at 0 we have:
 βX 
∂gµ ( β)
= ∂E e

∂β 0 ∂β
0
 βX 
∂e
= E
∂β 0
h i
= E Xe βX

0
= E [X] = 0 (3.11)

So, because we have shifted the mean to 0, the moment-generating function is flat at 0.

60
Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

Now we can differentiate the whole function:



∂e− βε gµ ( β)
= e−ε·0 gµ0 (0) − εe−ε·0 gµ (0) product rule
∂β
0
− ε ·0 0 − ε ·0
= e|{z} gµ (0) −ε e|{z} gµ (0) at β = 0
| {z } | {z }
1 1 1
0
= −ε (3.12)

We have determined that ∃ β > 0 such that e− βε gµ ( β) < 1, and thus there is a cε < 1 as stated in the
theorem.
The case ε < 0 is similar. All that changes is that for line 3.9 we substitute

Pr( X < ε) = Pr(e β ∑i Xi > e βnε ) for any β < 0 (3.13)

The rest of the derivation is identical up to and including line 3.12, which in this case shows that
∃ β < 0 such that e− βε gµ ( β) < 1, and thus there is a cε < 1 as stated in the theorem. 2
This method also allows us, in some cases, to find the value of cε which gives the tightest Chernoff
bound. (For general µ and ε this can be a complicated task and we may have to settle for bounds
on the best cε .)
Exercise: What is the mgf of the uniform distribution on ±1? What is the best cε ?

61
Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

3.5 Lecture 17 (9/Nov): Johnson-Lindenstrauss embedding `2 → `2


By a small sample we may judge of the whole piece.
Cervantes, Don Quixote de la Mancha §I-1-4

Today we’ll see a geometric application of the Chernoff bound. At first glance the question we
solve, which originates in analysis, appears to have nothing to do with probability. But actually it
illustrates a shared geometric core between analysis and probability.

Definition 57 A metric space ( M, d M ) is a set M and a function d M : M × M → (R ∪ ∞) that is


symmetric; 0 on the diagonal; and obeys the triangle inequality, d M ( x, y) ≤ d M ( x, z) + d M (z, y).

Examples:

1. A Euclidean space is a vector space Rn equipped with the metric d( x, y) = ∑1n ( xi − yi )2 .


p

2. The same vector space can be equipped with a different metric, for instance the `∞ metric,
maxi | xi − yi |, or the `1 metric, ∑i | xi − yi |.
Actually in real vector spaces the metrics we use, like these, are usually derived from norms
(see Sec. 3.5.1).
3. Sometimes we get important metrics as restrictions of another metric. For instance let ∆n
denote the probability simplex, ∆n = { x ∈ Rn : ∑i xi = 1, xi ≥ 0}. In this space (half of) the
`1 distance is referred to as “total variation distance”, dTV . It has another characterization,
dTV ( p, q) = max A⊆[n] p( A) − q( A).
Exercise: Usually a metric arises through a “min” definition (shortest path from one place to
another), and in Example 5 we will see that dTV does have that kind of definition. Why does
it coincide with a “max” definition?
4. Many metric spaces have nothing to do with vector spaces. An important class of metrics are
the shortest path metrics, derived from undirected graphs: If G = (V, E) is a graph and x, y ∈ V,
let d( x, y) denote the length of (number of edges on) a shortest path between them.
5. If you start with a metric d on a measurable space M you can “lift” it to the transportation metric
dtrans . This is much bigger: the “points” of this new metric space are probability distributions
on M, and the transportation distance is how far you have to shift probability mass in order
to transform one distribution to the other. Here is the formal definition for the case of a finite
space M. Let µ, ν be the two distributions. π will range over probability distributions on the
direct product space M2 .

dtrans (µ, ν) = min{∑ d( x, y)π ( x, y)|∀ x : ∑ π (x, y) = µ(x), ∀y : ∑ π (x, y) = ν(y), ∀x, y : π (x, y) ≥ 0}
x,y y x

Sometimes this is called “earthmover distance” (imagine bulldozers pushing the probability
mass around).
For example, if M is the graph metric on a clique of size k (as in Example 4) then dtrans = dTV =
variation distance among probability distributions on the vertices (i.e., the metric space of
Example 3).

Definition 58 An embedding f : M → M0 is a mapping of a metric space ( M, d M ) into another metric


d M0 ( f ( a), f (b)) d M (c,d)
space ( M0 , d M0 ). The distortion of the embedding is supa,b,c,d∈ M d M ( a,b)
· d M0 ( f (c), f (d))
. The mapping
is called isometric if it has distortion 1.

62
Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

A finite metric space is one in which the underlying set is finite. A finite `2 space is one that can be
embedded isometrically into a Euclidean space of any dimension.
Exercise: The dimension need not be greater than n − 1. (n points span only at most an (n − 1)-
dimensional affine subspace.)
Exercise: Generically, the dimension must be n − 1. (Show the distances between points in Euclidean
space determine their coordinates up to a rotation, reflection and translation. Then consider the
volume of the simplex.)
What we’ll see today is a method of embedding an n-point `2 metric into a very low-dimensional
Euclidean space with only slight distortion. This is useful in the theory of computation because
many algorithms for geometric problems have complexity that scales exponentially in the dimension
of the input space. We’ll have to skip giving example applications, but there are quite a few by now,
and because of these, a variety of improvements and extensions of the embedding method have also
been developed.
Our goal is to prove the following claim:

Theorem 59 (Johnson and Lindenstrauss [51]) Given a set A of n points in a Euclidean space, there
exists a map f : A → (Rk , `2 ) with k = O(ε−2 log n) that is of distortion eε on the metric restricted to
A. Moreover, the map f can be taken to be linear and can be found with a simple randomized algorithm in
expected time polynomial in n.

Although the points of A generically span an (n − 1)-dimensional affine space, and the map is
linear, nonetheless observe that we are not embedding all of Rn−1 with low distortion—that is
impossible, as the map is many-one—we care only about the distances among our n input points.

3.5.1 Normed spaces


A real normed space is a vector space V equipped with a nonnegative real-valued “norm” k · k
satisfying kcvk = ckvk for c ≥ 0, kvk 6= 0 for v 6= 0, and kv + wk ≤ kvk + kwk. Norms automatically
define metrics, as in examples 1, 2, by taking the distance between v and w to be kv − wk.
Let S = (S, µ) be any measure space. For p ≥ 1, the L p normed space w.r.t. the measure µ, L p (S),
is defined to be the vector space of functions

f :S→R

of finite “L p norm,” defined by


Z 1/p
p
k f kp = k f ( x )k dµ( x )
S

Exercise:
k f + gk p ≤ k f k p + k gk p

So (like any normed space), L p (S) is also automatically a metric space.


This framework allows us to discuss the collection of all L2 (Euclidean) spaces, all L1 spaces, etc.
The most commonly encountered cases are indeed L1 , L2 and L∞ , which is defined to be the sup
norm (so µ doesn’t matter). Today we discuss embeddings L2 → L2 . Time permitting we may also
discuss embeddings of general metrics into L1 .
We will use the shorthand Lkp to refer to an L p space on a set S of cardinality k, with the counting
measure.

63
Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

3.5.2 JL: the original method


Returning to the statement of the Johnson-Lindenstrauss Theorem (59), how do we find such a
map f ? Here is the original construction: pick an orthogonal projection, W̃, onto Rk uniformly at
random, and let f ( x ) = W̃x for x ∈ A.
For k as specified, this is satisfactory with high (constant) probability (which depends on the con-
stant in k = O(ε−2 log n)).
An equivalent description of picking a projection W̃ at random is as follows: choose U uniformly
(i.e., using the Haar measure) from O n (the orthogonal group). Let Q̃ be the n × n matrix which is
the projection map onto the first k basis vectors:
 
1 0 0 ··· 0 0
 0 1 0 ··· 0 0 
 
 .. 
Q̃ = 
 . .

 0 0 ··· 1 0 0 
 
 0 0 0 0 0 0 
0 0 0 0 0 0

Then set W = U −1 Q̃U. I.e., a point x ∈ A is mapped to U −1 Q̃Ux.


Let’s start simplifying this. The final multiplication by U −1 doesn’t change the length of any vector
so it is equivalent to use the mapping
x → Q̃Ux
and ask what this does to the lengths of vectors between points of A.
Having simplified the mapping in this way, we can now discard the all-0 rows of Q̃, and use just Q:
 
1 0 0 ··· 0 0
 0 1 0 ··· 0 0 
Q= .
 
..
 . 
0 0 ··· 1 0 0
So JL’s final mapping is
f ( x ) = QUx.

In order to analyze this map, we will consider a vector v, the difference between two points in A,
i.e. v = x − y for some x, y ∈ A.
Since the question of distortion of the length of v is scale invariant, we can simplify by supposing
that kvk = 1.
Moreover, the process described above has the same distribution for all rotations of v. That is to say,
for any v, w ∈ Rn and any orthogonal matrix A,

Pr( QUv = w) = Pr( QU ( Av) = w). (prob. densities)


U U

So we may as well consider that v is the vector v = (1, 0, 0, . . . , 0)∗ . (Where ∗ denotes transpose.)
In that case, k QUvk equals k( QU )∗1 k where ( QU )∗1 is the first column of QU. But ( QU )∗1 =
(U1,1 , U2,1 , . . . , Uk,1 )∗ , i.e., the top few entries of the first column of U.
Since U is a random orthogonal matrix, the distribution of its first column (or indeed of any other
single column) is simply that of a random unit vector in Rn .

64
Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

So the whole question boils down to showing concentration for the length of the projection of a
random unit vector onto the subspace spanned by the first k standard basis vectors.
This distribution is somewhat deceptive in low dimensions. For n = 2, k = 1 the density looks like
Figure (3.4).
2.5

1.5

0.5

0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

Figure 3.4: Density of projection of a unit vector in 2D onto a random unit vector

However, in higher dimensions, this density looks more like Figure (3.5). The phenomenon we are
encountering is truly a feature of high dimension.

−1 0 1

Figure 3.5: Density of projection of a unit vector in high dimension onto a random unit vector

Remarks:

1. In the one-dimensional projection


h density
i (Fig. 3.5) some constant fraction of the probability
is contained in the interval −
√ 1 √1
, n .
n

2. The squares of the projection-lengths onto each of the k dimensions are “nearly independent”
random variables, so long as k is small relative to n.

Johnson and Lindenstrauss pushed this argument through but there is an easier way to get there,
by just slightly changing the construction.

65
Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

3.5.3 JL: a similar, and easier to analyze, method


Pick k vectors w1 , w2 , . . . , wk independently from the spherically symmetric Gaussian density with
standard deviation 1, i.e., from the probability density
!
1 1 n 2
2 i∑
η ( x1 , . . . , x n ) = exp − xi
(2π )n/2 =1

Note 1: the projection of this density on any line through the origin is the 1D Gaussian with standard
deviation 1, i.e., the density
x2
 
1
√ exp −
2π 2
(Follows immediately from the formula, by factoring out the one dimension against an entire “con-
ditioned” Gaussian on the remaining n − 1 dimensions.)
Note 2: The distribution is invariant under the orthogonal group. (Follows immediately from the
formula.)
Note 3: The coordinates x1 , x2 etc. are independent rvs. (Follows immediately from the formula.)
Set  
... ... w1 ... ...
 ... ... w2 ... ... 
W=
 
.. 
 . 
... ... wk ... ...
(The rows of W are the vectors wi .) Then, for v ∈ Rn set f (v) = Wv.
By Notes 1 & 3, each entry of W is an i.i.d. random variable with density √1 exp(− x2 /2).

Informally, this process is very similar to that of JL, although it is certainly not identical. Individual
entries of W can (rarely) be very large, and rows are not going to be exactly orthogonal, although
they will usually be quite close to orthogonal.
Because of Note 2, analysis of this method boils down, just as for the original JL construction, to
showing a concentration result for the length of the first column of W, which we denote w1 .
Because of Note 3, the expression kw1 k2 = ∑1k wi1
2 gives the LHS as the sum of independent, and by

Note 1 iid, rvs. This will enable us to show concentration through a Chernoff bound.

66
Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

3.6 Lecture 18 (12/Nov): cont. JL embedding; Bourgain embedding


3.6.1 cont. JL
Recall that our projection of (any particular) unit vector in the original space, is a vector whose
2 ) = 1. So E ( w2 ) = k. We want a
coordinates w11 , . . . , wk1 are iid normally distributed with E(wi1 ∑ i1
deviation bound on ∑ wi1 . 2

2 is a “χ2 ” rv with parameter 1, and their sum is a χ2 rv with


There is a name for these rvs: each wi1
parameter k.

1.2

k=1
1.0
k=2
0.8 k=3
k=4
0.6
k=10
0.4

0.2

0.0
0 1 2 3 4

1 2
Figure 3.6: Probability density of k ∑ wi1

2 − 1 so that E ( y ) = 0. With this change of variables we now want a


Set random variables yi = wi1 i
bound on the deviation from 0 of the rv y = 1k ∑ik=1 yi .
To get a Chernoff bound, we need the mgf, g( β), for yi , in order to use Eqn. 3.8 to write:

P(y/ε > 1) < [ inf e−εβ g( β)]k for ε 6= 0. (3.14)


β >0

So what is g( β)?
2 −1)
g( β) = E(e βy ) = E ( e β(w )
Z ∞
1 2
= e− β √ ew ( β−1/2) dw
−∞ 2π
− Z ∞ p
e β 1 − 2β − 1 w2 (1−2β)
= p √ e 2 dw
1 − 2β −∞ 2π
e− β
= p
1 − 2β
The last equality follows as the integrand is the density of a normal random variable with standard
deviation √ 1 .
1−2β

Thus, g( β) is well defined and differentiable in (−∞, 21 ), with (necessarily) g(0) = 1 (which recall
from (3.7) holds for the mgf of any probability measure), and g0 (0) = 0 (because g0 (0) = the

67
Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

first moment of the probability measure, that’s why it’s called the moment generating function,
recall (3.11); and we have centered the distribution at 0).
For a given ε what β should be used in the Chernoff bound (Eqn. 3.14)? After some calculus, we
find that β = 2(1ε+ε) is the best value (for both signs of ε). Figure (3.7) shows the dependence of β
on ε.
β(ϵ )

ϵ
- 1.0 - 0.5 0.5 1.0

- 0.5

- 1.0

- 1.5

- 2.0

- 2.5

Figure 3.7: Best choice of β as a function of ε for the χ2 distribution: β = ε


2(1+ ε )

Plugging this value of β above into the bound, we get


1
P(y/ε > 1) < ((1 + ε) 2 e− 2 )k
ε
(3.15)
1
which we incidentally note is (1 − 12 ε2 + O(ε3 ))k . The function (1 + ε) 2 e− 2 is shown in Fig. 3.8.
ε

1.0

0.8

0.6

0.4

0.2

ϵ
2 4 6 8

1
Figure 3.8: Base of the Chernoff bound for the χ2 distribution: cε = (1 + ε) 2 e− 2
ε

Now let’s apply this bound to the modified JL construction. We will ensure distortion (Defn. 58)
eδ (with positive probability) by showing that for each of our (n2 ) vectors v, with probability >
1 − 1/(n2 ),

1
kvke−δ/2 ≤ √ kWvk ≤ kvkeδ/2 .
k

68
Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

We already argued, by the invariance of our construction under the orthogonal group, that for any
v this has the same probability as the event
r
1
k ∑ i1
−δ/2
e ≤ w2 ≤ eδ/2

1
k ∑ i1
e−δ ≤ w2 ≤ e δ

or equivalently
e−δ − 1 ≤ ȳ ≤ eδ − 1. (3.16)

Applying the Chernoff bound (3.15) first on the right of (3.16), we have
2 /4
Pr(ȳ > eδ − 1) < ek(δ/2−(e
δ −1) /2)
= e(k/2)(1+δ−e ) < e−kδ
δ

Next applying the Chernoff bound (3.15) on the left of (3.16), we have
−δ −1)/2) −δ ) 2 /4+O ( δ3 ))
Pr(ȳ < e−δ − 1) < ek(−δ/2−(e = e(k/2)(1−δ−e < e−k(δ

In all, taking k = 8(1 + O(δ))δ−2 log n suffices so that Pr( 1k ∑ wi1


2 ∈/ [e−δ , eδ ]) < 1/n2 and therefore
so the mapping with probability at least 1/2 has distortion bounded by eδ .
Finally, for the computational aspect: to get a randomized “Las Vegas” algorithm simply try matri-
ces W at random and examine each to test whether the distortion is satisfactory.
Note: About another embedding question: Finite l2 metric spaces can be embedded in l1 isomet-
rically. There’s also an algorithm—deterministic, in fact—to find such an embedding, but it takes
exponential time in the number of points in the space.
Comment: There are deterministic poly-time algorithms producing an embedding up to the stan-
dards of the Johnson-Lindenstrauss theorem, see Engebretsen, Indyk and O’Donnell [27], Sivaku-
mar [82].

3.6.2 Bourgain embedding X → L p , p ≥ 1


In the previous result, we saw how an already “rigid” metric, namely an L2 metric, could be
embedded in reduced dimension. Now we will see how a relatively “loose” object, just a metric
space, can be embedded in a more rigid object, namely a vector space with an L p norm. There will
be a price in distortion to pay for this.

O(log2 n)
Theorem 60 (Bourgain [18]) Any metric ( X, d) with n = | X | can be embedded in L p , p ≥ 1, with
distortion O(log n). There is a randomized poly-time algorithm to find such an embedding.

Some comments are in order.


Dimension: The dimension bound here is actually due not to Bourgain but to Linial, London and
Rabinovich [63]. Also, Bourgain showed embedding into L2 ; after we prove the L1 result we’ll show
how it also implies all p ≥ 1. A later variation of the Bourgain proof that achieves dimension
O(log n) is due to Abraham, Bartal and Neiman [1].
Derandomization: It will follow from ideas we see soon, that there is a deterministic algorithm to
construct a Bourgain embedding into dimension poly(n). This will be on a problem set. It is
also possible, by the method of conditional probabilities, to reduce the dimension to O(log2 n); we
probably won’t have time to discuss this.

69
Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

Distortion: The distortion in the theorem is best possible: expander graphs require it. However,
there are open questions for restricted classes of metrics: for example whether the distortion can be
improved, possibly to a constant, for shortest path metrics in planar graphs. See [50] for a survey
on metric embeddings from 2004.
Method: Weighted Fréchet embeddings.

3.6.3 Embedding into L1


Proof: Since the domain of our mapping is merely a metric space rather than a normed space, we
cannot apply anything like the JL technique, and something quite different is called for.
Bourgain’s proof employs a type of embedding introduced much earlier by Fréchet [35].
The one-dimensional Fréchet embedding imposed by a set T ⊂ X is the mapping

τ : X → R+

τ ( x ) = d( x, T ) := min d( x, t)
t∈ T

Observe that by the triangle inequality for d, |τ ( x ) − τ (y)| ≤ d( x, y). So τ is a contraction.


We can also combine several such Ti ’s in separate coordinates. If we label the respective mappings
τi and give each a nonnegative weight wi , with the weights summing to 1—that is to say, the weights
form a probability distribution:

τ ( x ) = (w1 τ1 ( x ), . . . , wk τk ( x ))

then we can consider the composite mapping τ as an embedding into L1k and it too is contractive,
namely,
kτ ( x ) − τ (y)k1 ≤ d( x, y).

So the key part of the proof is the lower bound.


Let s = dlg ne. For 1 ≤ t ≤ s and 1 ≤ j ≤ s0 ∈ Θ(s), choose set Ttj by selecting each point of
X independently with probability 2−t . Let all the weights be uniform, i.e., 1/ss0 . This defines an
embedding τ = (. . . , τtj , . . .)/ss0 of the desired dimension.
We need to show that with positive probability

∀ x, y ∈ X : kτ ( x ) − τ (y)k1 ≥ Ω(d( x, y)/s).

Just as in JL, the proof proceeds by considering just a single pair x 6= y and showing that with prob-
ability greater than 1 − 1/n2 (enabling a union bound) it is embedded with the desired distortion
(in this case O(log n) = O(s)).

70
Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

3.7 Lecture 19 (14/Nov): cont. Bourgain embedding

3.7.1 cont. Bourgain embedding: L1


We use this notation for open balls:

Br ( x ) = {z : d( x, z) < r }

and for closed balls, B̄r ( x ) = {z : d( x, z) ≤ r }.


Recall that we are now analyzing the embedding for any single pair of points x, y.
Let ρ0 = 0 and, for t > 0 define

ρt = sup{r : | Br ( x )| < 2t or | Br (y)| < 2t } (3.17)


up to t̂ = max{t : RHS < d( x, y)/2}.
It is possible to have t̂ = 0 (for instance if no other points are near x and y).
Observe that for the closed balls B̄ we have that for all t ≤ t̂, | B̄ρt ( x )| ≥ 2t and | B̄ρt (y)| ≥ 2t . This
means in particular that (due to the radius cap at d( x, y)/2, which means that y is excluded from
these balls around x and vice versa), t̂ < s.
Set ρt̂+1 = d( x, y)/2, which means that it still holds for t = t̂ + 1 that | Bρt ( x )| < 2t or | Bρt (y)| < 2t ,
although (in contrast to t ≤ t̂), ρt̂+1 is not the largest radius for which this holds.
Note t̂ + 1 ≥ 1. Also, ρt̂+1 > ρt̂ (because the latter was defined to be less than d( x, y)/2). But for
t ≤ t̂ it is possible to have ρt = ρt−1 .
t̂ + 1 will be the number of scales used in the analysis of the lower bound for the pair x, y. I.e., we
use the sets Ttj for 0 ≤ t ≤ t̂ + 1. Any contribution from higher-t (smaller expected cardinality) sets
is “bonus.”
Consider any 1 ≤ t ≤ t̂ + 1.

Lemma 61 With positive probability (specifically at least (1 − 1/ e)/4), |τt1 ( x ) − τt1 (y)| > ρt − ρt−1 .

Proof: Suppose wlog that | Bρt ( x )| < 2t . By Eqn (3.17) (with t − 1), | B̄ρt−1 (y)| ≥ 2t−1 (and the same
for x but we don’t need that). If
Tt1 ∩ Bρt ( x ) = ∅ (3.18)
and
Tt1 ∩ B̄ρt−1 (y) 6= ∅ (3.19)
then
kτt1 ( x ) − τt1 (y)k > ρt − ρt−1 .
We wish to show that this conjunction happens with constant probability. (See Fig. 3.9.)
The two events (3.18), (3.19) are independent because Tt1 is generated by independent sampling,
and because, due to the radius cap at d( x, y)/2 (and because ρt̂ < ρt̂+1 ), Bρt ( x ) ∩ B̄ρt−1 (y) = ∅.
First, the x-ball event (3.18):

Pr( Tt1 ∩ Bρt ( x ) = ∅) = (1 − 2−t )| Bρt ( x)|


t
≥ (1 − 2− t )2
t
= (1 − 2− t )2
≥ 1/4 for t ≥ 1

71
Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

Figure 3.9: Balls Bρt−1 ( x ), Bρt ( x ), Bρt−1 (y) depicted. Events 3.18 and 3.19 have occurred, because no
point has been selected for Tt1 in the larger-radius (ρt ) region around x, while some point (marked
in red) has been selected for Tt1 in the smaller-radius (ρt−1 ) region around y.

(For large t this is actually about 1/e.)


Second, the y-ball event (3.19):

Pr( Tt1 ∩ B̄ρt−1 (y) 6= ∅) = 1 − (1 − 2−t )| Bρt−1 (y)|


t −1
≥ 1 − (1 − 2− t )2
and recalling 1 + x ≤ e x for all real x,
≥ 1 − e−1/2

Consequently, |τt1 ( x ) − τt1 (y)| > ρt − ρt−1 with probability at least (1 − 1/ e)/4. 2

Now, let Gx,y,t be the “good” event that at least (1 − 1/ e)/8 of the coordinates at level t, namely
0
{τtj }sj=1 , have
|τtj ( x ) − τtj (y)| > ρt − ρt−1 .
If the good event occurs for all t, then for all x, y,

1 (1 − 1/ e) d( x, y)
kτ ( x ) − τ (y)k1 ≥ .
s 8 2
Here the first factor is from the normalization, the second from the definition of good events, and
the third from the cap on the ρt ’s.
We can upper bound the probability that a good event Gx,y,t fails to happen using Chernoff:
0
Pr(¬ Gx,y,t ) ≤ e−Ω(s ) .
Now taking a union bound over all x, y, t,
0
Pr(∪ x,y,t ¬ Gt ) ≤ e−Ω(s ) n2 lg n < 1/2
for a suitable s0 ∈ Θ(log n).
To be specific we can use the following version of the Chernoff bound (see problem set 4):

Lemma 62 Let F1 , . . . , Fs0 be independent Bernoulli rvs, each with expectation ≥ µ. Pr(∑ Fi < (1 −
2 0
ε)µs0 ) ≤ e−ε µs /2 .

32 e
which permits us (plugging in ε = 1/2) to take s0 = √
e −1
log(n2 lg n). 2
Exercise: Form a Fréchet embedding X → Rn by using as Ti ’s all singleton sets. Argue that this is
an isometry of X into (Rn , L∞ ). Consequently L∞ is universal for finite metrics. (This, I believe, was
Fréchet’s original result [35].)

72
Schulman: CS150 2018 CHAPTER 3. CONCENTRATION OF MEASURE

3.7.2 Embedding into any L p , p ≥ 1


As a matter of fact the above embedding method has distortion just as good into L p , for any p ≥ 1.
Start by expanding:
!1/p
1
kτ ( x ) − τ (y)k p =
ss0 ∑(τij (x) − τij (y)) p (3.20)
ij

We begin with the upper bound, which is unexciting:


!1/p
1
(3.20) ≤
ss0 ∑(d(x, y)) p
ij

= d( x, y).

For the lower bound, we use the power-means inequality. Note that (3.20) is a p’th mean of the
quantities (τij ( x ) − τij (y)), ranging over i, j. So from Lemma 15,

1
∑ τij (x) − τij (y) = kτ (x) − τ (y)k1

(3.20) ≥
ss0 ij

so for any τ and any p > 1, the L p distortion of τ is no more than its L1 distortion. This demonstrates
O(log2 n) O(log2 n)
the generalization of Theorem (60) with L p (p ≥ 1) replacing L1 .

3.7.3 Aside: Hölder’s inequality


Although we already proved the power means inequality directly in Lemma 15, it is worth seeing
how it fits into a framework of inequalities. The power means inequality is a comparison between
two integrals over a measure space that is also a probability space (i.e., the total measure of the
space is 1). Power means follows immediately from an important inequality that holds for any
measure space (and indeed generalizes the Cauchy-Schwarz inequality), Hölder’s inequality:

Lemma 63 (Hölder) For norms with respect to any fixed measure space, and for 1/p + 1/q = 1 (p and q
are “conjugate exponents”), k f k p · k gkq ≥ k f gk1 .

To see the power means inequality, note that over a probability space, k f k p is simply a p’th mean.
Now plug in the function g = 1 and Hölder gives you power means.

73
Chapter 4

Limited independence

4.1 Lecture 20 (16/Nov): Pairwise independence, Shannon coding


theorem again, second moment inequality

4.1.1 Improved proof of Shannon’s coding theorem using linear codes


Very commonly, in Algorithms, we have a tradeoff between how much randomness we use, and
efficiency.
But sometimes we can actually improve our efficiency by carefully eliminating some of the ran-
domness we’re using. Roughly, the intuition is that some of the randomness is going not toward
circumventing a barrier (especially, leaving the adversary in the dark about what we are going to
do), but just into noise.1
A case in point is the proof of Shannon’s Coding Theorem. In a previous lecture we proved the
theorem as follows: we first built an encoding map E : {0, 1}k → {0, 1}n by sampling a uniformly
random function; then, we had to delete up to half the codewords to eliminate all kinds of fluctua-
tions in which codewords fell too close to one another.
It turns out that this messy solution can be avoided. The key observation is that our analysis
depended only on pairwise data about the code—basically, pairwise distances between codewords.
“Higher level” structure (mutual distances among triples, etc.) didn’t feature in the analysis. So the
argument will still go through with a pairwise-independently constructed code. So we’ll do this
now, and in the process we’ll see how this helps.
Sample E from the following pairwise independent family of functions {0, 1}k → {0, 1}n . Select k
vectors v1 , . . . , vk iid ∈U {0, 1}n . Now map the vector ( x1 , . . . , xk ) to ∑1k xi vi . This is, of course, a
linear map, consisting of multiplication by the generator matrix G whose rows are the vi :
 
− − − v1 − − −
 − − − v2 − − − 
 − − − . . . − − −  = (codeword)
(message x )  

− − − vk − − −

The message 0̄ ∈ {0, 1}k is always mapped to the codeword 0̄ ∈ {0, 1}n , and every other codeword
is uniformly distributed in {0, 1}n . It is not hard to see that the images of messages are pairwise
independent. (Including even the image of the 0̄ message.)
1 People who pack a tent, wind up spending the night on the mountain – a climbing instructor of mine

74
Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

Let’s see why: say the two messages are x 6= x 0 . W.l.o.g. x 0 6= 0. Now, we want to show that ∀y, y0 ,
Pr( x 0 G = y0 | xG = y) = 2−n . Since x 0 6= x (and we are over GF2), this means that x 0 ∈ / span( x ).
Consequently, there exists a G s.t. xG = y, x 0 G = y0 . But since there exists such a G, call it G0 the
number of G’s satisfying this pair of equations does not depend upon y0 ; the set of such G’s is simply
equal to all G0 + G 0 where x, x 0 ∈ ker G 0 . The number of such G 0 depends only upon dim span( x, x 0 ).
(If you want a more concrete argument, you can change basis to where x, if nonzero, is a singleton
vector, and x 0 − x is another singleton vector. Then the row of G corresponding to x is y, the row
corresponding to x 0 − x is y0 − y, and the rest of the matrix can be anything.)
Now let’s remember some of the settings we used for this theorem in Section (3.3.1):
k
(1) The code rate is (3.3) n = D2 ( pk1/2)−ε
;
(2) First upper bound on δ is (3.4): p + δ < 1/2;
(3) Second upper bound on δ is (3.5): D2 ( p + δk1/2) > D2 ( pk1/2) − ε/2;
And finally we make δ as large as we can subject to these constraints, and set c = min{ D2 ( p +
δk p), ε/2} > 0.
Looking back at the analysis of the error probability on message X in Section (3.3.1), it had two
parts, in each of which we bounded the probability of one of the following two sources of error:

Bad1 : H (E ( X ) + R, E ( X )) ≥ ( p + δ)n. That is to say, the error vector R has weight (number of 1’s)
at least ( p + δ)n. This analysis is of course unchanged, and doesn’t depend at all on choice of
the code. As before, the bound is

Pr(Bad1 ) = Pr( H (~0, R) ≥ ( p + δ)n) ≤ 2− D2 ( p+δk p)n ≤ 2−cn .


R R

Bad2 : ∃ X 0 6= X : H (E ( X ) + R, E ( X 0 )) ≤ ( p + δ)n. For this, pairwise independence is enough to


obtain an analysis similar to before. Specifically, for any pair X 6= X 0 and any R, the rv (which
now depends only on the choice of code) R + E ( X ) − E ( X 0 ) is uniformly distributed in {0, 1}n
(because X − X 0 is not the zero string, so E ( X − X 0 ) is uniform) so, the choice of R does not
affect PrR (Bad2 ), and we can bound it as

Pr(Bad2 ) ≤ 2k−nD2 ( p+δk1/2)


R
= 2n( D2 ( pk1/2)−ε− D2 ( p+δk1/2)) from (3.3)
−nε/2
≤2 from (3.5)
−cn
≤2

So, we get the same as before: PrE ,R (Error on X ) ≤ 21−cn for the same c > 0 that depends only on
p, ε. That is, for every X, with MX = PrR (Error on X |E ), we have

EE ( MX ) ≤ 21−cn (4.1)

Next, just as before, we wish to remove E from the randomization in the analysis. In order to do
this it helps to consider the uniform distribution over messages X and derive from Eqn. 4.1 the
weaker
EX,E ( MX ) ≤ 21−cn (4.2)
The reason is that this weaker guarantee is maintained even if we now modify the decoding algo-
rithm so that it commutes with translation by codewords. Specifically, no matter what the decoder
did before, set it now so that D(Y ) is uniformly sampled among “max-likelihood” decodings of Y,

75
Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

which is equivalent (thanks to the uniformity over X and to the noise R being independent of X) to
those X which minimize H (E ( X ), Y ). For the uniform distribution, max-likelihood decoding min-
imizes the average probability of error, so this new decoder D also satisfies 4.2. The new decoder
has the commutation advantage that we promised: for any E ,

commutes with
D(E ( X ) + R) = D(E ( X )) + D( R)
translation by code
 (4.3)
decoding correct
= X + D( R)
on codewords

As a consequence,
For all E , X1 , X2 : Pr(Error on X1 |E ) = Pr(Error on X2 |E ).
R R

So we can define a variable M which is a function of E ,


M = Pr(Error on 0̄|E ) = Pr(Error on X |E ) for all X
R R

and we have
EE ( M ) ≤ 21−cn
Since M ≥ 0, PrE ( M > 22−cn ) < 1/2 and so if we just pick linear E at random, there is probability
≥ 1/2 that (using the already-described decoder D for it), for all X the decoding-error probability
is ≤ 22−cn .
What is much more elegant about this construction than about the preceding fully-random-E is that
no X’s with high error probabilities need to be thrown away. The set of codewords is always just a
linear subspace of {0, 1}n .
The code also has a very concise description, O(k2 ) bits (recall n ∈ Θ(k)); whereas the previous
full-independence approach gave a code with description size exponential in k.
One comment is that although picking a code at random is easy, checking whether it indeed satisfies
the desired condition is slow: one can either do this in time exponential in n, exactly, by exhaustively
considering R’s, or one can try to estimate the probability of error by sampling R, but even this will
require time inverse in the decoding-error-probability of R until we see error events and can get a
good estimate of the error probability of R; in particular we cannot certify a good code this way in
time less than 2cn .

4.1.2 Pairwise independence and the second-moment inequality


A common situation in which we use Chebyshev’s inequality, Lemma 13 is when we have many
variables which are not fully independent, but are pairwise independent (or nearly so).

Definition 64 (Pairwise and k-wise independence) A set of rvs are pairwise independent if every pair
of them are independent; this is a weaker requirement than that all be independent. Likewise, the variables are
k-wise independent if every subset of size k is independent.

Definition 65 (Covariance) The covariance of two real-valued rvs X, Y is (if well-defined) Cov( X, Y ) =
E( XY ) − E( X ) E(Y ).

Exercise: Show that if X and Y are independent then Cov( X, Y ) = 0, but that the converse need not
be true.
Exercise: If X = ∑1n Xi , Var X = ∑i Var( Xi ) + ∑i6= j Cov( Xi , X j ).

76
Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

Corollary 66 If X1 , . . . , Xn are pairwise independent real rvs with well-defined variances, then Var(∑ Xi ) =
∑ Var( Xi ). (We already mentioned this in (3.1).) If in addition they are identically distributed and X =
1 1
n ∑ Xi , then E ( X ) = E ( X1 ) and Var( X ) = n Var( X1 ).

Exercise: Apply the Chebyshev inequality to obtain:

Lemma 67 (2nd moment inequality) If X1 , . . . , Xn are identically


q distributed, pairwise-independent real
Var( X1 )
rvs with finite 1st and 2nd moments then P(| X − E( X )| > λ n ) < 1/λ2 .

Corollary 68 (Weak Law) Pairwise independent rvs obey the weak law of large numbers. Specifically, if
X1 , . . . , Xn are identically distributed, pairwise-independent real rvs with finite variance then for any ε,
limn→∞ P(| X − E( X )| > ε) = 0.

So we see that the weak law holds under a much weaker condition than full independence. When
we talk about the cardinality of sample spaces, we’ll see why pairwise (or small k-wise) indepen-
dence has a huge advantage over full independence, so that it is often desirable in computational
settings to make do with limited independence.

77
Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

4.2 Lecture 21 (19/Nov): G (n, p) thresholds

4.2.1 Threshold for H as a subgraph in G (n, p)


Working with low moments of random variables can be incredibly effective, even when we are
not specifically looking for limited-independence sample spaces. Here is a prototypical example.
“When” does a specific, constant-sized graph H, show up as a subgraph of a random graph selected
from the distribution G (n, p)? We have in mind that we are “turning the knob” on p. If H has any
edges then when p = 0, with probability 1 there is no subgraph isomorphic to H. When p = 1,
with probability 1 such subgraphs are everywhere 2 . In between, for any finite n, the probability is
some increasing function of p. But we won’t take n finite, we will take it tending to ∞.
So the question is,3 can we identify a function π (n) such that in the model G (n, p(n)), with JHK
denoting the event that there is an H in the random graph G,
(a) If p(n) ∈ o (π (n)), then limn Pr(JHK) = 0.
(b) If p(n) ∈ ω (π (n)), then limn Pr(JHK) = 1.
Such a function π (n) is known as the threshold for appearance of H. It follows from work of
Bollobas and Thomason [17] that monotone events—events that must hold in G 0 if they hold in
some G ⊆ G 0 —always have a threshold function.
(A related but incomparable statement: for a monotone graph property, i.e., a monotone property
invariant under vertex permutations, for any ε > 0 there is a p(n) such that Pr p(n) (property) ≤ ε
and Pr p(n)+O(1/ log n) (property) ≥ 1 − ε. See [37].)

4.2.2 Most pairs independent: threshold for K4 in G (n, p)


Let S ⊂ {1, . . . , n}, |S| = 4. Let XS be the event that K4 occurs as a subgraph of G at S—that is,
when you look at those four vertices, all the edges between them are present. Conflating XS with
its indicator function and letting X be the number of K4 ’s in G, we have

X= ∑ XS
S

and  
n 6
E( X ) = p .
4

We are interested in Pr( X > 0). Let π (n) = n−2/3 .


(a) For p(n) ∈ o (π (n)), E( X ) ∈ o (1), so Pr(JK4 K) ∈ o (1) and therefore limn Pr(JK4 K) = 0.
(b) For 1 > p(n) ∈ ω (π (n)), E( X ) ∈ ω (1). We’d like to conclude that likely X > 0 but we do
not have enough information to justify this, as it could be that X is usually 0 and occasionally very
large.4 We will exclude that possibility for K4 by studying the next moment of the distribution.
Before carrying out this calculation, though, we have to make one important note. Since the event
JK4 K is monotone, [ p ≤ p0 ] ⇒ [PrG(n,p) JK4 K ≤ PrG(n,p0 ) JK4 K]. (An easy way to see this is by choosing
reals iid uniformly in [0, 1] at each edge, and placing the edge in the graph if the rv is above the p
2 Today we focus on H = K , the 4-clique, but more generally this method will establish the probability of any fixed graph
4
H occurring as a subgraph in G, that is, ∃ injection of V ( H ) into V ( G ) carrying edges to edges. This is different from asking
that H occur as an induced subgraph of G, which requires also that non-edges be carried to non-edges. That question is
different in an essential way: the event is not monotone in G.
3 Recall p ( n ) ∈ o ( π ( n )) means that lim sup p ( n ) /π ( n ) = 0, and p ( n ) ∈ ω ( π ( n )) means that lim sup π ( n ) /p ( n ) = 0.
4 When we study not K -subgraphs, but other subgraphs, this can really happen. We’ll discuss this below.
4

78
Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

or p0 threshold.) This means that it is enough to show that K4 “shows up” slightly above π. This
is useful because some of our calculations break down far above π, not because there is anything
wrong with the underlying statement but because the inequalities we use are not strong enough to
be useful there and a direct calculation would need to take account of further moments.
To simplify our remaining calculations, then, let
p = n−2/3 g(n), so n4 p6 = g6
for any sufficiently small g(n) ∈ ω (1); we’ll see how this is helpful in the calculations.
By an earlier exercise,
Var( X ) = ∑ Var(XS ) + ∑ Cov( XS , XT )
S S6= T

XS is a coin (or Bernoulli rv) with Pr( XS = 1) = p6 . The variance of such an rv is p6 (1 − p6 ).


The covariance terms are more interesting.

1. If |S ∩ T | ≤ 1, no edges are shared, so the events are independent and Cov( XS , XT ) = 0.


2. If |S ∩ T | = 2, one edge is shared, and a total of 11 specific edges must be present for both
cliques to be present. A simple way to bound the covariance is (since E( XS ), E( XT ) ≥ 0) that
Cov( XS , XT ) = E( XS XT ) − E( XS ) E( XT ) ≤ E( XS XT ) = p11 .
3. If |S ∩ T | = 3, three edges are shared, and a specific 9 edges must be present for both cliques
to be present. Similarly to the previous case, Cov( XS , XT ) ≤ p9 .
     
n 6 n n
Var( X ) ≤ p (1 − p6 ) + p11 + p9
4 2, 2, 2 3, 1, 1
∈ O(n4 p6 + n6 p11 + n5 p9 )
= O( g6 n4−4 + g11 n6−22/3 + g9 n5−6 ) from p = n−2/3 g(n)
= O( g6 + g11 n−4/3 + g9 n−1 ) (4.4)
6 5 4/3 3
= O( g ) provided g ∈ O(n ) and g ∈ O(n)

This gives us the key piece of information. For g ∈ ω (1) but not too large, we have
Var( X ) O ( g6 ) O ( g6 )
∈ = = O ( g −6 ) ⊆ o (1 )
( E( X ))2 Θ((n4 p6 )2 ) Θ( g12 )
and we have only to apply the Chebyshev inequality (Cor. 14) (or better yet Paley-Zygmund,
Lemma 74 which we haven’t proven yet) to conclude that Pr( X = 0) ∈ o (1) and so
lim Pr(JK4 K) = 1. (4.5)
n

Since JK4 K is a monotone event, (4.5) holds even for g above the range we needed for the calculation
to hold. (Note, though, since there is so much “room” in the calculation, we could even have used
the upper bound O( g11 ) on 4.4, and not resorted to this monotonicity argument.)
Exercise: Show that the threshold for appearance, as a subgraph, of the graph with 5 edges and 4
vertices is n−4/5 .
Comment: For a general H the threshold for appearance of H in G (n, p) as a subgraph is determined
not by the ratio ρ H of edges to vertices, but by the maximum of this ratio over induced subgraphs
of H, call it ρmax H . We’ll see this on a problem set (and see [8]). If these numbers are different then
above n−1/ρ H the expected number of H’s starts tending to ∞ but almost certainly we have none;
once we cross the higher threshold n−1/ρmax H , there is an “explosion” of many of these subgraphs
appearing. (They show up highly intersecting in the fewer copies of the critical induced subgraph.)

79
Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

4.3 Lecture 22 (21/Nov): Concentration of the number of prime


factors; begin Khintchine-Kahane for 4-wise independence
Now for an application of near-pairwise independence in number theory.
Let m(k) be the number of primes dividing k. Hardy and Ramanujan showed that for large k this
number is almost always close to log log k. Specifically, let k ∈U [n], and let M be the rv

M = m ( k ).

Always M ≤ lg k. But usually this is a vast overestimate:

Theorem 69 (Hardy & Ramanujan) Let 1 ≤ λ ≤ n1/4 .


p 1 + o (1)
Pr(| M − log log k| > λ log log k) < .
λ2

p much sense for k < 16; instead more formally the failure event should be Jk < 16 OR | M −
(This doesn’t make
log log k| > λ log log kK.)

Proof: We show an elegant proof due to Turan.


Before we begin the proof in earnest let’s simplify things. The function log log,
√ besides being
monotone, is so slowly growing
√ that it
√ hardly distinguishes between n and n. Specifically,
log log n = log 2 + log log n, so for k ≥ n we have

log log n − log 2 ≤ log log k ≤ log log n

which in particular implies

| M − log log n| + log 2 ≥ | M − log log k|

Consequently:
p √
Pr(| M − log log k| > λ log log k ) ≤ Pr(k ≤ n)
p √ √
+ Pr(| M − log log n| + log 2 > λ log log n − log 2|k > n) Pr(k > n)

Use Pr( A| B) = Pr( A ∩ B)/ Pr( B) ≤ Pr( A)/ Pr( B):



. . . = Pr(k ≤ n)
p
+ Pr(| M − log log n| + log 2 > λ log log n − log 2) (4.6)

| M−log log n|+log 2 log log n−log 2+log 2
If the event in (4.6) holds for λ ≥ 1 then | M−log log n|
≤ √ . Consequently,
log log n−log 2
setting p
0 | M − log log n| log log n − log 2
λ =λ p
(| M − log log n| + log 2) log log n
we can say p p
log log n − log 2 + log 2 log log n
λ≤ p p λ0 ∈ (1 + o (1))λ0
log log n − log 2 log log n − log 2

80
Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

Now continuing from (4.6),


p 1
Pr(| M − log log k| > λ log log k) ≤ √ (4.7)
n
+ Pr(| M − log log n| > λ0
p
log log n) (4.8)

If λ0 > (lg n)/


p
log log n, the probability on line (4.8) is 0 because M ≤ lg n, so we’re done.

It remains to consider the range λ0 ≤ (lg n)/ log log n. In this range, the 1/ n term in 4.7 is
p
1+ o (1) 1+ o (1)
dominated by 1/λ02 . Since 1
λ 02
≤ λ2
, it remains for the theorem only to bound (4.8) by λ 02
.

Proposition 70
1 + o (1)
Pr(| M − log log n| > λ0
p
log log n) < .
λ 02

For prime p let Jp|kK be the indicator rv for p dividing k. Note M = ∑ p Jp|kK.

E(Jp|kK) = bn/pc/n

So
1/p − 1/n ≤ E(Jp|kK) ≤ 1/p

1 1
−1+ ∑ p
≤ E( M) ≤ ∑ p
(4.9)
prime p≤n prime p≤n

For k ≥ 1 let π (k ) = |{ p : p ≤ k, p prime}|. We remind ourselves of the

Theorem 71 (Prime number theorem) π (k ) ∈ (1 + o (1))k/ log k.

We use the following corollary (proof omitted in class):5

1
Corollary 72 ∑prime p≤n p ∈ (1 + o (1)) log log n.
5 Proof: For k ≥ 2, π (k ) − π (k − 1) is the indicator rv for k being prime.

1 π ( k ) − π ( k − 1)
∑ p
= ∑ k
prime p≤n 2≤ k ≤ n
 
π (n) 1 1
= + ∑ π (k) −
n 2≤ k ≤ n −1
k k+1
π (k)
= o (1) + ∑ k ( k + 1)
2≤ k ≤ n −1
1 + o (1)
= o (1) + ∑ (k + 1) log k
by the prime number theorem
2≤ k ≤ n −1

There’s now a minor exercise. We want to move the (1 + o (1)) out of the summation but inside it refers to k, outside to n.
Exercise: Moving it outside is justified for any summation which tends to ∞, which we will shortly see is true of this one. In
the process we can also subsume the additive o (1). More formally, Exercise: Let ak , bk ≥ 0 be series such that ak ∈ (1 ± ok (1))bk
and ∑1n ak ∈ ωn (1) (where the subscripts of o or ω indicate the variable tending to ∞). Then ∑1n bk ∈ (1 + on (1)) ∑1n ak .
Now applying this:
1
= (1 + o (1)) ∑ (k + 1) log k
2≤ k ≤ n −1
1
= (1 + o (1)) ∑ k log k
using same exercise again
2≤ k ≤ n −1

81
Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

So, from Eqn. (4.9) and Cor. 72 we know that


E( M ) ∈ (1 + o (1)) log log n.
Now for the variance of M. The proposition will follow from showing
Var( M) = (1 + o (1)) log log n (4.10)
and an application of the Chebyshev inequality.
To show (4.10), as always we can write
Var( M ) = ∑ Var(Jp|kK) + ∑ Cov(Jp|kK, Jq|kK) (4.11)
prime p≤n primes p6=q≤n

What we will discover is that the sum of covariances is very small and so the bound on Var( M ) is
almost as if were had pairwise independence between the events Jp|kK.
As we’ve already noted on occasion, for a {0, 1}-valued rv Y, Var(Y ) = E(Y )(1 − E(Y )) ≤ E(Y ).
Applying this we have

∑ Var(Jp|kK) ≤ ∑ E(Jp|kK) ∈ (1 + o (1)) log log n. (4.12)


prime prime
p≤n p≤n

Now to handle the covariances. Observe that for primes p 6= q, Jp|kKJq|kK is the indicator rv Jpq|kK.
n 1
Just as for primes, E(Jpq|kK) = b pq c/n ≤ pq . So

Cov(Jp|kK, Jq|kK) = E(Jpq|kK) − E(Jp|kK) E(Jq|kK)


  
1 1 1 1 1
≤ − − −
pq p n q n
 
1 1 1
≤ +
n p q
This is a very low covariance, which is crucial to the theorem.
 
1 1 1
∑ Cov(Jp|kK, Jq|kK) ≤ ∑ n p + q
primes primes
p6=q≤n p6=q≤n

2 1  prime
= (1 + o (1)) π (n) ∑ number
n p
primes theorem

p≤n
2
= (1 + o (1)) π (n) log log n Cor. (72)
n
2 log log n
= (1 + o (1))
log n
By the same exercise, this can be evaluated by comparison to an integral:
Z n −1
1
= (1 + o (1)) dx
2 x log x
Z log(n−1)
1
= (1 + o (1)) eu du substitution x = eu
log 2 eu u
Z log n
1
= (1 + o (1)) du
log 2 u
= (1 + o (1)) log log n
2

82
Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

This is dominated by the sum of variances in (4.11), i.e., by (4.12) (this is even tending to 0), so we
have established (4.10). 2

4.3.1 4-wise independent random walk


Earlier we quoted the CLT or Khintchine-Kahane to conclude that the value of the Gale-Berlekamp
game is Ω(n3/2 ). Specifically we used this to show that for a symmetric random walk of length n,
X = ∑1n Xi with Xi ∈U {1, −1}, E(| X |) ∈ Ω(n1/2 ). Now we will show this from first principles—and
more importantly, using only information about the 2nd and 4th moments.
This is not only of methodological interest. It makes the conclusion more robust, specifically the
conclusion holds for any 4-wise independent space, and therefore implies a poly-time deterministic
algorithm to find a Gale-Berlekamp solution of value Ω(n3/2 ), because there exist k-wise indepen-
dent sample spaces of size O(nbk/2c ), as we will show in a later lecture.

Theorem 73 Let X = ∑1n Xi where the Xi are 4-wise independent and Xi ∈U {1, −1}. Then E(| X |) ∈
Ω(n1/2 ).

Proof: We start with two calculations. These calculations are made easy by the fact that for any
b b
product of the form Xi 1 · · · Xi 4 , with i1 , . . . , i4 distinct and bi ≥ 0 integer,
1 4

b 0 if any b is odd
E ( Xi 1 · · · Xib44 ) =
1 1 otherwise

So now
E( X 2 ) = ∑ E(Xi Xj ) = ∑ E(Xi2 ) = n
i,j i

E( X 4 ) = 3 ∑ E( Xi2 X 2j ) − 2 ∑ E( Xi4 ) = 3n2 − 2n.


i,j

One is tempted to apply Chebyschev’s inequality (in the form of Cor. 14) to the rv X 2 , because
we know both its expectation and its variance. Unfortunately, the numbers are not favorable!
Var( X 2 ) = 3n2 − 2n − n2 = 2n2 − 2n > n2 = ( E( X 2 ))2 . So Cor. 14 gives us only Pr( X 2 = 0) ≤
Var( X 2 )/( E( X 2 ))2 , where 1 ≤ Var( X 2 )/( E( X 2 ))2 , which is useless. (Let alone that we would actu-
ally want to bound the larger quantity Pr( X 2 ) < cn for some c > 0.)
There are two ways to solve this, and I’ll show you both. For the strongest bound for Gale-
Berlekamp, however, one may skip the next section and proceed to Sec. 4.4.2.

83
Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

4.4 Lecture 23 (26/Nov): Cont. Khintchine-Kahane for 4-wise in-


dependence; begin MIS in NC

4.4.1 Paley-Zygmund: solution through an in-probability bound


Paley-Zygmund is usually stated as an alternative (to Cor. 14) lower-tail bound for nonnegative rvs;
i.e., it gives a way to say that a nonnegative rv A is “often large”.
Let µi be the ith moment of A. Knowing only the first moment µ1 of A is not enough, because
for any value—even infinite—of the first moment, we can arrange, for any δ > 0, a nonnegative rv
A which equals 0 with probability 1 − δ, yet has first moment µ1 . We just have to move δ of the
probability mass out to the point µ1 /δ, or, in the infinite µ case, spread δ probability mass out in a
measure whose first moment diverges.
However, a finite second moment µ2 is enough to justify such a “usually large” statement6 :
Actually PZ can be stated for rvs which are not necessarily nonnegative, so we’ll do it that way.

Lemma 74 (Paley-Zygmund) Let A be a real rv with positive µ1 and finite µ2 . For any 0 < λ ≤ 1,

λ2 µ21
Pr( A > (1 − λ)µ1 ) > .
µ2

Proof: Let ν be the distribution


R of A. Let p = Pr(RA > (1 − λ)µ1 ). (This is what we want to lower
bound.) Decompose µ1 = [−∞,(1−λ)µ ] x dν( x ) + ((1−λ)µ ,∞] x dν( x ). Now examine each of these
1 1
terms. Z
x dν( x ) ≤ (1 − p)(1 − λ)µ1 ≤ (1 − λ)µ1 (4.13)
[−∞,(1−λ)µ1 ]

Apply Cauchy-Schwarz7 to the functions x2 and Ix>(1−λ)µ1 . These are not effectively proportional
to each other w.r.t. ν (unless ν is supported on a single point, in which case µ21 = µ2 and the Lemma
is immediate), so we get a strict inequality,

Z r Z
x dν( x ) < p x2 dν( x ) = p1/2 µ1/2
2 (4.14)
((1−λ)µ1 ,∞] −∞

Putting (4.13), (4.14) together, µ1 < (1 − λ)µ1 + p1/2 µ1/2


2

λµ1 < p1/2 µ1/2


2

as desired. (There’s not normally much to be gained by preserving the “1 − p” factor in (4.13), but
it’s at least another reason for writing strict inequality in the Lemma.) 2
6 We
don’t need to also assume anything about µ1 because µ21 ≤ µ2 , by nonnegativity of the variance (special case of the
power means inequality).
7

R
Lemma 75 (Cauchy-Schwarz) If functions f , g are square-integrable w.r.t. measure ν then f ( x ) g( x ) dν( x ) ≤
qR R
f 2 ( x ) dν( x ) · g2 ( x ) dν( x ).
RR 2
Proof:
RR Squaring and subtracting sides, it suffices to show: 0 ≤ f ( x ) g2 ( y ) dν( x )dν(y) −
f ( x ) g( x ) f (y) g(y) dν( x )dν(y). This is equivalent (by swapping the dummy variables) to showing 0 ≤
RR 2
( f ( x ) g2 (y) + f 2 (y) g2 ( x ) − 2 f ( x ) g( x ) f (y) g(y)) dν( x )dν(y) = ( f ( x ) g(y) − f (y) g( x ))2 dν( x )dν(y) which is an
R R
integral of squares. 2
Say that f and g are effectively proportional to each other w.r.t. ν if this last integral is 0; this is the condition for equality
in Cauchy-Schwarz.

84
Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

µ2 −µ21 µ2 −µ21
Comment: This gives Pr( A ≤ 0) ≤ µ2 which improves on the upper bound µ21
of Cor. 14. It
should be said though that PZ does not dominate Cor. 14 in all ranges (e.g., if the variance µ2 − µ21
is very small compared to the µ21 , and λ is small).
Returning to Gale-Berlekamp: Lemma 74 is not directly usable for our purpose, i.e., we cannot plug
in the rv A = | X |, because all it will tell us is that µ1 ≥ (1 − λ)λ2 µ31 /µ2 , i.e., µ2 ≥ (1 − λ)λ2 µ21 ,
which follows already from Cauchy-Schwarz (with the better constant 1). Note, this shows how
Paley-Zygmund serves as a more flexible, albeit slightly weaker, version of Cauchy-Schwarz.
Instead, we set B = | X | and A = B2 , and then apply Paley-Zygmund to A.
This is not a technicality. It means that we are relying on 4-wise independence, not just 2-wise
independence, of the Xi ’s. And indeed, Exercise: There are for arbitrarily large n, collections of n
pairwise independent Xi ’s, uniform in ±1, s.t. Pr(| X | = 0) = 1 − 1/n, Pr(| X | = n) = 1/n.

Corollary 76 Let B be a nonnegative rv with Pr( B = 0) < 1 and fourth moment µ4 ( B) < ∞. Then
√ µ5/2 /µ4 .
E( B) ≥ 16
25 52

2 2 2
p For any θ, E( B) ≥ θ Pr( B ≥ θ ) = θ Pr( B ≥ θ ), so, applying Lemma 74 to A = B , with
Proof:
θ = µ2 ( B)/5 and λ = 4/5,
r
µ2 ( B ) (4/5)2 µ2 ( B)5/2
E( B) ≥ Pr( B2 ≥ µ2 ( B)/5) ≥ √ .
5 5 µ4 ( B )
2

4.4.2 Berger: a direct expectation bound


µ2 ( B)3/2
Lemma 77 (Berger [11]) Let B be a nonnegative rv with 0 < µ4 ( B) < ∞. Then µ1 ( B) ≥ µ4 ( B)1/2
.

This is stronger than Cor. 76 for two reasons: the constant, and perhaps more importantly, because
µ2 /µ1/2
4 ≤ 1 (power mean inequality).
Of course, this lemma does not give an in-probability bound, so it is incomparable with Lemma 74.
Lemma 77 is a special case of the following, with p = 1, q = 2, r = 4:

Lemma 78 Let 0 < p < q < r and let B be a nonnegative rv with probability measure θ, θ ({0}) < 1. For
r− p q− p
− r −q
x > 0 let µ x = E( B x ). Then µ p ( B) ≥ µq ( B) r−q µr ( B) .

Proof: A more usual way to write this is


r −q q− p
r− p r− p
µ q ≤ µ p µr (4.15)

Note the exponents sum to 1, and that the average of p and r weighted by the exponents is q, i.e.,
r−q q−p
p· +r· =q
r−p r−p

so (4.15) is a consequence of an important fact, the log-concavity of the moments (i.e., of µq as a


function of q). We’ll show this next. 2

85
Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

Lemma 79 (Log concavity of the moments) If θ is a probability distribution on the nonnegative reals
2
with θ ({0}) < 1 then, for all q at which yq log2 y dθ converges absolutely, ∂∂2 q log µq > 0.
R

Proof:

∂2 ∂2
Z
2
log µq = 2 log yq dθ
∂ q ∂ q
∂ yq log y dθ
R
=
∂q µq
µq (y log2 y + yq−1 ) dθ − ( yq log y dθ )2
R q R
=
µ2q
µq y log2 y dθ − ( yq log y dθ )2
R q R

µ2q
1
Z Z  
= 2 x q yq log2 y − x q yq log x log y dθ ( x ) dθ (y)
µq
1
Z Z
= x q yq (log y − log x )2 dθ ( x ) dθ (y)
2µ2q
≥0

4.4.3 cont. proof of Theorem 73


We apply Lemma 77. Substituting our known moments of the rv | X |,

n3/2 √
E(| X |) ≥ 2 1/2
≥ n/3.
(3n − 2n)

Observe that we have lost only a small constant factor here compared with the precise value ob-
tained for a fully-independent sample space from the CLT. 2

4.4.4 Maximal Independent Set in NC


Parallel complexity classes

L = log-space = problems decidable by a Turing Machine having a read-only input tape and a
read-write work tape of size (for inputs of length n) O(log n).
NC = k NC k , where NC k = languages s.t. ∃c < ∞ s.t. membership can be computed, for inputs of
S

size n, by nc processors running for time logk n.


RNC = same, but the processors are also allowed to use random bits. For x ∈ L Pr( error ) ≤ 1/2,
for x ∈
/ L Pr( error ) = 0.
L ⊆ NC1 ⊆ . . . ⊆ NC ⊆ RNC ⊆ RP.
P-Complete = problems that are in P, and that are complete for P w.r.t. reductions from a lower
complexity class (usually, log-space).

86
Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

Maximal Independent Set

MIS is the problem of finding a Maximal Independent Set. That is, an independent set that is not
strictly contained in any other. This does not mean it needs to be a big, let alone a maximum
cardinality set. (It is NP-complete to find an independent set of maximum size. This is more
commonly known as the problem of finding a maximum clique, in the complement graph.)
There is an obvious sequential greedy algorithm for MIS: list the vertices {1, . . . , n}. Use vertex
1. Remove it and its neighbors. Use the smallest-index vertex which remains. Remove it and its
neighbors, etc.
The independent set you get this way is called the Lexicographically First MIS. Finding it is P-
complete w.r.t. L-reductions [22]. So it is interesting that if we don’t insist on getting this particular
MIS, but are happy with any MIS, then we can solve the problem in parallel, specifically, in NC2 .
We’ll see an RNC, i.e., randomized parallel, algorithm of Luby [64] for MIS. Then, we’ll see how to
derandomize the algorithm. (Some of the ideas we’ll see also come from the papers [55, 7]).
Notation: Dv is the neighborhood of v, not including v itself. dv = | Dv |.
Luby’s MIS algorithm:
Given: a graph G = (V, E) with n vertices.
Start with I = ∅.
Repeat until the graph is empty:

1
1. Mark each vertex v pairwise independently with probability 2dv +1 .

2. For each doubly-marked edge, unmark the vertex of lower degree (break ties arbitrarily).

3. For each marked vertex v, append v to I and remove the vertices v ∪ Dv (and of course all
incident edges) from the graph.

An iteration can be implemented in parallel in time O(log n), using a processor per edge.
We’ll show that an expected constant fraction of edges is removed in each iteration (and then we’ll
show that this is enough to ensure expected logarithmically many iterations).

Definition 80 A vertex v is good if it has ≥ dv /3 neighbors of degree ≤ dv . (Let G be the set of good
vertices, and B the remaining ones which we call bad.) An edge is good if it contains a good vertex.

1
Lemma 81 If dv > 0 and v is good, then Pr(∃ marked w ∈ Dv after step (1) ) ≥ 18 .

dv 1
This follows immediately from the following, using ∑w∈ Dv Pr(w marked ) ≥ 3 2dv +1 ≥ 19 .

1
Lemma 82 If { Xi } are pairwise independent events s.t. Pr( Xi ) = pi then Pr( Xi ) ≥ min( 21 , ∑ pi ).
S
2

Compare with the pairwise-independent version of the second Borel-Cantelli lemma. Of course,
that is about guaranteeing that infinitely many events occur, here we’re just trying to get one to
occur, but the lemmas are nonetheless quite analogous.
Proof: If ∑ pi < 1/2 then consider all events, otherwise there is a subset s.t. 1/2 ≤ ∑ pi ≤ 1
(consider two cases depending on whether any pi exceeds 1/2); apply the following argument just
to that subset.

87
Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

∑ pi − ∑ Pr(Xi ∩ Xj )
[
Pr( Xi ) ≥ Bonferroni level 2
i< j

= ∑ pi − ∑ pi p j
i< j
1
≥ ∑ pi − 2 ∑ pi ∑ p j
i j
1
= (∑ pi )(1 −
2∑ i
p)
1
2∑ i
≥ p.

2
So we can run the algorithm using a pairwise independent space, with the bits having various
biases 2dv1+1 .

88
Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

4.5 Lecture 24 (28/Nov): Cont. MIS, begin derandomization from


small sample spaces

4.5.1 Cont. MIS


Lemma 83 If v is marked then the probability it is unmarked in step (2) is ≤ 1/2.

Proof: It is unmarked only if a weakly-higher-degree neighbor is marked. Each of these events


happens, conditional on v being marked, with probability at most 2dv1+1 . Apply a union bound. 2

Corollary 84 The probability that a good vertex is removed in step 3 is at least 1/36.

Proof: Immediate from the previous two lemmas. 2


Now for our measure of progress.

Lemma 85 At least half the edges in a graph (V, E) are good.

Proof: Sort the vertices from left to right so that du < dv ⇒ u < v (ties arbitrarily). Direct each edge
from lower to higher degree vertex; now we have in-degrees din v and out-degrees dv .
out

A bad vertex has > 2dv /3 neighbors with degree > dv . (4.16)

For two sets of vertices V1 , V2 let E(V1 , V2 ) be the edges directed from V1 to V2 . (In particular
E = E(V, V ).) If by Ê(V1 , V2 ) we mean the set of undirected edges associated with the directed
edges E(V1 , V2 ), then note that Ê( B, B) ∪ Ê( B, G ) ∪ Ê( G, B) ∪ Ê( G, G ) is a disjoint partition of the
edges of the graph, | E( B, B)| = | Ê( B, B)|, and

Ê( B, B) is the set of bad edges.


Ê( B, G ) ∪ Ê( G, B) ∪ Ê( G, G ) is the set of good edges.

From (4.16), every v ∈ B has at least twice as many outgoing edges as incoming edges. Each edge
in E( B, B) contributes one incoming edge to B, so there must be at least 2| E( B, B)| outgoing edges
from B; only | E( B, B)| of these can be accounted for by outgoing edges of E( B, B). The remainder
are accounted for by edges in E( B, G ), so | E( B, G )| ≥ | E( B, B)|. Consequently | E( B, B)| ≤ | E|/2.
2
Due to the corollary, each good edge is removed with probability at least 1/36. Of course the edge-
removals are correlated, but in any case, since at least half the edges are good, the expected fraction
of edges removed is at least 1/72. In the next section we analyze how long it takes this process to
terminate.
First, a comment: the analysis above was not sensitive to the precise probability 2dv1+1 with which
vertices were marked. For instance, it would be fine if each vertex were marked with some proba-
bility pv , 4d1v ≤ pv ≤ 2dv1+1 ; the only effect would be to change the “1/72” to some other constant.
1
We will actually only modify each 2dv +1 by a factor 1 + on (1) when we derandomize the algorithm.

89
Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

4.5.2 Descent Processes


(This is not widespread terminology but things like this come up often. The coupon collector
problem which we saw in Sec. 1.3.2 is another example.)
In a descent process the state of the process is a nonnegative integer n; the process terminates
when n = 0. At n > 0, you sample a random variable X from a distribution pn on {0, . . . , n}, and
transition to state n − X. The question is, how many iterations does it take you to hit 0? Let Tn be
this random variable, when you start from state n. (So E( T0 ) = 0.) Write θn = E pn ( X ) = ∑0n ipn (i ).

Lemma 86 For i > 0 let g(i ) = mini≤m≤n θm . Then E( Tn ) ≤ ∑1n 1


g (i )
.

This bound is pretty good if θn is monotone increasing, which is the common situation. It can be a
bad bound if there are a few bottlenecks which the descent process usually avoids. In our case we
won’t know g exactly, but we will know lower bounds on it, which is enough.
Note that g is nondecreasing and g(n) = θn .
Proof: The lemma is trivial for n = 0 (the LHS is 0 and the RHS is an empty summation). For
n > 0 proceed by induction.
n
E( Tn ) = 1 + pn (0) E( Tn ) + ∑ pn (i ) E( Tn−i )
i =1

n
(1 − pn (0)) E( Tn ) = 1 + ∑ pn (i ) E( Tn−i )
i =1
n n −i
1
≤ 1 + ∑ p n (i ) ∑ induction
i =1 j =1
g( j)
!
n n n
1 1
= 1 + ∑ p n (i ) ∑ − ∑
i =1 j =1
g ( j ) j = n − i +1 g ( j )
n n n
1 1
= 1 + (1 − pn (0)) ∑ − ∑ p n (i ) ∑
j =1
g ( j ) i =1 j = n − i +1
g ( j)
!
n n n n
1 i i 1
= 1 + (1 − pn (0)) ∑ − ∑ p n (i ) + ∑ p n (i ) − ∑
j =1
g ( j ) i =1 g ( n ) i =1 g ( n ) j = n − i +1 g ( j )
n n
1 i
≤ 1 + (1 − pn (0)) ∑ − ∑ p n (i ) g nondecreasing, g(n) = θn
j =1
g ( j ) i =1 θn
n
1
= 1 + (1 − pn (0)) ∑ −1
j =1
g( j)

4.5.3 Cont. MIS


| E|
As a consequence, the expected number of iterations until the algorithm terminates is ≤ ∑1 72i ∈
O(log | E|) ∈ O(log n). Each iteration alone takes time O(log n) to do the local marking and unmark-
ing, and removing vertices and edges. This is an RNC2 algorithm, using O(| E|) processors, for MIS.

90
Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

In Section 4.6.5 we’ll see how we can derandomize this using a factor of O(n2 ) more processors,
and thereby put MIS in NC2 .
Question: Here is a different parallel algorithm for MIS. At each vertex v choose uniformly a real
number rv in [0, 1]. Put a vertex in I if rv > ru for every neighbor u of v. Remove I and all its
neighbors. Repeat. (We don’t really need to pick random real numbers; we can just pick multiples
of 1/n3 , and we’re unlikely to have any ties to deal with.) This process is a bit simpler than Luby’s
algorithm because there is no “unmarking”. Question: If the rv ’s are chosen independently, is it the
case that the expected number of edges that are removed, is a constant fraction of | E|? If so, is this
still true if the rv ’s are pairwise independent?

4.5.4 Begin derandomization from small sample spaces


We discussed in an earlier lecture the notion of linear error-correcting codes. We worked over the
base field GF (2), also known as Z/2. (Which is to say, we added bit-vectors using XOR.) Encoding
of messages in such a code is simply multiplication of the message, as a vector v ∈ F m , by the
generator matrix C of the code; the result, if C is m × n, is an n-bit codeword.
 
generator matrix
(message v)   = (codeword vC )
C

The set of codewords is exactly Rowspace(C ).


The minimum weight of a linear code is the least number of 1’s in a codeword.
If the minimum weight is k + 1 then the code is k-error-detecting; this property ensures

1. Error detection up to k errors


2. Error correction up to bk/2c errors.

This property is not possessed by codes achieving near-optimal rate in Shannon’s coding theorem.
That theorem protects only against random noise, and if that is what you want, then the mininimum
weight property is too strong to allow optimally efficient codes. It protects against adversarial noise.

91
Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

4.6 Lecture 25 (30/Nov): Limited linear independence, limited sta-


tistical independence, error correcting codes.
4.6.1 Generator matrix and parity check matrix
Error detection can be performed with the aid of the parity check matrix M:

Left Nullspace( M ) = Rowspace(C )

 
  parity  
generator matrix  check 
   =  0 
 matrix 
C
M

wM = 0 ⇐⇒ w ∈ Rowspace(C ) ⇐⇒ w is a codeword

 
Every vector in   Every k rows
Rowspace(C ) has ⇐⇒ of M are linearly
weight ≥ k + 1 independent
 

In coding theory terms, this is an (n, m, k + 1) code over GF (2). (Unfortunately, coding theorists
conventionally use the letters (n, k, d) but we have k + 1 reserved for the least weight, because we’re
following the conventional terminology from “k-wise independence”.)
For any fixed values of n and k, the code is most efficient when the message length m, which is the
number of rows of C, is as large as possible; equivalently, the number of columns of M, ` = n − m,
is as small as possible. So we’ll want to design a matrix M with few columns in which every k rows
are linearly independent.
But first, let’s see a connection between linear and statistical independence.
Let B be a k × ` matrix over GF (2), with full row rank. (So k ≤ `.)
If x ∈U ( GF (2))` then y = Bx ∈U ( GF (2))k ,
 
   
y B  x 
=

because the pre-image of any y is an affine subspace (a translate of the right nullspace of B). (We
already made this observation in the context of Freivalds’ verification algorithm, Theorem 26.)
Now, if we have a matrix M with n rows, of which every k are linearly independent, then every k
bits of z = Mx are uniformly distributed in ( GF (2))k .

   
 
   
   
 z  =  M  x 
   
   

We’ve exhibited dual applications of the parity check matrix:

92
Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

• Action on row vectors: checking validity of a received word w as a codeword. (s = wM is


called the “syndrome” of w; in the case of non-codewords, i.e., s 6= 0, one of the ways to
decode is to maintain a table containing for every s, the least-weight vector η s.t. η M = s.
Then w − η is the closest codeword to w. This table-lookup method is practical for some very
high rate codes, where there are not many possible vectors s.)
• Action on column vectors: converting few uniform iid bits into many k-wise independent
uniform bits.

Now we can see an entire sample space on n bits that are uniform and k-wise-independent. At the
right end we place the uniform distribution on all 2` vectors of the vector space GF (2)` .

   
 
    0 0 ... 1 1
 Ω 
   
  = 
 M  ...
 unif. dist. on cols 
    0 1 ... 0 1

Ω is the uniform distribution on the columns on the LHS.


n−`
Maximizing the transmission rate m n = n of a binary, k-error-detecting code, is equivalent to
minimizing the size |Ω| = 2 of a linear k-wise independent binary sample space.
`

So how big does |Ω| have to be?

Theorem 87

1. For all n there is a sample space of size O(nbk/2c ) with n uniform k-wise independent bits.
For larger ranges one has: For all n there is a sample space of size O(2k max{m,dlg ne} ) with n k-wise
independent rvs, each uniformly distributed on [2m ].
2. For all n, any sample space on n k-wise independent bits, none of which is a.s. constant, has size
Ω(nbk/2c ).

We show Part 1 in Sec. 4.6.3; Part 2 will be on the problem set.


First though, returning to the subject of codes, there is a question worth asking even though we
don’t need it for our current purpose:

4.6.2 Constructing C from M


Suppose we have constructed a parity check matrix M. How do we then get a generator matrix C?
One should note that over a finite field, Gram-Schmidt does not work. Gram-Schmidt would have
allowed us to produce new vectors which are both orthogonal to the columns of M and linearly
independent of them. But this is generally not possible: the row space of C and the column space
of M do not necessarily span the n-dimensional space. For example over GF (2) we may have
 
 1
C= 1 1 , M=
1

However, Gaussian elimination does work over finite fields, and that is what is essential.

93
Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

Specifically, given n × a M and n × b N, b < n − a, N of full column rank (i.e., rank b), we show
how to construct a vector c s.t. cM = 0 and c0 is linearly independent of the columns of N. (Then
adjoin c0 to N and repeat.)
Perform Gaussian elimination on the columns of N so that it is lower triangular, with a nonzero
diagonal. (That is, allowed operations are adding columns to one another, and permuting rows.
When permuting rows of N, permute the rows of M to match.) Obviously this does not change the
column space of N (except for the simultaneous permution of rows in N and M).
Now take the submatrix of M consisting of the a + 1 rows b + 1, . . . , b + a + 1. By Gaussian elimina-
tion on these rows we can find a linear dependence among them. Extending that dependence to the
n-dimensional space with 0 coefficients elsewhere yields a vector c s.t. cM = 0 and s.t. the support
of c is disjoint from the first b coordinates. Then c is linearly independent of the column space of N
because the restriction of N to its first b rows is nonsingular, so any linear combination of the rows
of N has a nonzero value somewhere among its first b entries.

4.6.3 Proof of Thm (87) Part (1): Upper bound on the size of k-wise independent
sample spaces
(We’ll do this carefully for producing binary rvs and only mention at the end what should be done
for larger ranges.)
This construction uses the finite fields whose cardinalities are powers of 2. These are called exten-
sion fields of GF (2). If you are not familiar with this, just keep in mind that for each integer r ≥ 1
there is a (unique) field with 2r elements. We can add, subtract, multiply and divide these without
leaving the set; in particular, in the usual way of representing the elements of the field as bit strings
of length r, addition is simply XOR addition.8 Specifically, we can think of the elements of GF (2r )
as the polynomials of degree ≤ r − 1 over GF (2), taken modulo some fixed irreducible polynomial
p of degree r. That is, a field element c has the form c = cr−1 xr−1 + . . . + c1 x + c0 (mod p( x )), and
our usual way of representing this element is through the mapping β : GF (2r ) → ( GF (2))r given
by β(c) = (cr−1 , . . . , c0 ). (I.e., the list of coefficients.)
But all we really need today are three things: (a) Like GF (2), GF (2r ) is a field of characteristic 2,
i.e., 2x = 0. (b) For matrices over GF (2r ) the usual concepts of linear independence and rank apply.
(c) β is injective, linear (namely β(c) + β(c0 ) = β(c + c0 )), and β(1) = 0 . . . 01.
Now, round n up to the nearest n = 2r − 1, and let a1 , . . . , an denote the nonzero elements of the
field. Let M1 be the following Vandermonde matrix over the field GF (2r ):

a21 a1k−1
 
1 a1 ...
 1 a2 a22 ... a2k−1 
M1 = 
 ...

... ... ... ... 
1 an a2n ... akn−1
Exercise: Every k rows of M1 are linearly independent over GF (2r ). (Form any such submatrix B,
say that using the first k rows. Verify that Det( B) = ∏i< j ( a j − ai ).)
M1 is an n × k matrix over GF (2r ). Next, expand each of its entries as a row vector of bits, thus
forming an n × kr matrix M2 over GF (2):

β( a1k−1 ) = 001
 
β(1) = 001 β( a1 ) = 001 ...
 β(1) = 001 β( a2 ) = 010 ... β( a2k−1 ) = . . . 
M2 =  
 ... ... ... ... 
β(1) = 001 β( an ) = 111 ... k − 1
β( an ) = . . .
8 See any introduction to Algebra, for instance Artin [9].

94
Schulman: CS150 2018 CHAPTER 4. LIMITED INDEPENDENCE

Corollary: Every k rows of M2 are linearly independent over GF (2).


Actually it is possible to even further reduce the number of columns while retaining the corollary.
First, we can drop the leading 0’s in the first entry.
Second, we can strike out all batches of columns generated by positive even powers.

β( a31 ) = 001
 
1 β( a1 ) = 001 ... ...
 1 β( a2 ) = 010 β( a32 ) = . . . ... ... 
M3 = 
 
... ... ... ... ... 
1 β( an ) = 111 β( a3n ) = . . . ... ...

Lemma 88 Every set of rows that is linearly independent (over GF (2r )) in M1 is also linearly independent
(over GF (2)) in M3 . Hence every k rows of M3 are linearly independent.

Proof: Let a set of rows R be independent in M1 ; we show the same is true in M3 . Since M3 is
over GF (2), this is equivalent to saying that for every ∅ ⊂ S ⊆ R, the sum of the rows S in M3 is
nonzero.
So we are to show that S independent in M1 has nonzero sum in M3 . Independence in M1 implies
in particular that the sum in M2 of the rows in S is nonzero.
If |S| is odd then the same sum in M3 has a nonzero first entry and we are done.
Otherwise, let t > 0 be the smallest value such that ∑i∈S ait 6= 0; it is enough to show that t is odd.
Suppose not, so t = 2t0 . Then, since Characteristic( GF (2r )) = 2,
!2
0 0
∑ a2ti = ∑ ait
i ∈S i ∈S
0
so ∑i∈S ait 6= 0, contradicting minimality of t. 2
Finally for the binary construction, recalling that n = 2r − 1, we have |Ω| = 21+rbk/2c ∈ O(nbk/2c ).
Comment: If you want n k-wise independent bits with nonuniform marginals, then this construction
doesn’t work. The best general construction, due to Koller and Megiddo [58], is of size O(nk ).
Larger ranges: this is actually simpler because we’re not achieving the factor-2 savings in the expo-
nent. Let r, as in the statement, be r = max{m, dlg ne}. Just form the matrix M1 .

4.6.4 Back to Gale-Berlekamp


We now see a deterministic polynomial-time algorithm for playing the Gale-Berlekamp game. As
we demonstrated last time, it is enough to use a 4-wise independent sample space in order to
achieve Ω(n3/2 ) expected performance. The above construction gives us a 4-wise independent
sample space of size O(n2 ). All we have to do is exhaustively list the points of the sample space
until we find one with performance Ω(n3/2 ).

4.6.5 Back to MIS


For MIS we need only pairwise independence, but want the marking probabilities pv to be more
varied (approximately 2dv1+1 ). This, however, is easy to achieve: use the matrix M1 , with k = 2,
without modifying to M2 and M3 . This generates for each v an element in the field GF (2r ); these
elements are pairwise independent; and one can designate for each v a fraction of approximately
1
2dv +1 elements which cause v to be marked. The deterministic algorithm is therefore as described
in Sec. 4.4.4, with a space of size O(n2 ).

95
Chapter 5

Lovász local lemma

5.1 Lecture 26 (3/Dec): The Lovász local lemma


The Lovász local lemma is a fairly widely applicable tool for controlling the interactions among a
large collection of random variables.
Consider a probability space in which there is a long (possibly infinite) list of “bad” events B1 , . . .
that might occur. We may wish to show that the union of the bad events is not the entire space.
That is, with c denoting complement, we wish to show that ( Bi )c = Bic 6= ∅.
S T

There are in the probabilistic method two elementary tools to show this kind of statement:

1. The union bound. If ∑ Pr( Bi ) < 1 then ( Bi )c 6= ∅.


S

2. Independence. If Pr( Bi ) < 1 for all i, and the Bi are mutually independent, then for any finite
Bi = ∏1n (1 − Pr( Bi )) > 0.
T c
n, Pr

Let’s consider two toy examples of using just the second tool, independence.
First toy example: no matter how many fair coins you toss, it is possible for all to come up Heads.
Second toy example: Show that any finite (or even infinite but locally finite) tree has a valid 2-
coloring. Of course this is trivial (and true also without assuming local finiteness), but can you
show it by just coloring the vertices uniformly at random? Suppose the tree has n + 1 vertices.
There are n “bad” events, each corresponding to a particular edge being monochromatic; these are
mutually independent (check). So the probability that the tree is properly colored is 2−n > 0. This
shows that there is a valid coloring of the tree, even though the probability that a random coloring
is valid is vanishing in n. (For an infinite, locally finite tree, extend the argument by compactness.)
The second toy example illustrates that the use of independence is very fragile. If you insert into
the tree just one extra edge closing an odd-length cycle—no matter how long that cycle is—the de-
pendence induced among distant events is enough to ruin the 2-colorability. So even an assumption
of (n − 1)-way independence among n variables is not sufficient to imply that all good events may
intersect.
What Lovász did was to create an argument, somewhat like the independence argument we set
out above, but more robust, which still works in situations where the bad events are not entirely
independent. His argument is a wonderful combination of tools (1) and (2). We present here one
form of Lovász’s bound.

96
Schulman: CS150 2018 CHAPTER 5. LOVÁSZ LOCAL LEMMA

Definition 89 (Dependency graph) Let B1 , . . . be a finite set of events labeled by a set S . A “dependency
graph” for the events is a directed graph whose vertices are the set S , with the following property. Let Di be
the set of in-neighbors of i. (We do not include an event among its own in-neighbors.) Then for all i, Bi is
independent of the product random variable ∏ j∈S−{i}− Di Bj .

Observe that the same set of events may be assigned many different dependency graphs. In partic-
ular, any edges may be added; more significantly, there can be incomparable minimal dependency
graphs. Example: consider three events indexed by i = {1, 2, 3}, each with Pr( Bi ) = 1/2, pairwise
independent but such that the number of bad events is always even. A digraph is a dependency
graph iff every vertex has indegree at least 1.
Those familiar with the notion of a “graphical model” should note this is a different concept.1

Lemma 90 (Lovász [30]) Suppose that for all i, | Di | ≤ ∆ and Pr( Bi ) ≤ 1


Bic 6= ∅. In
T
e ( ∆ +1)
. Then i ∈S
other words, it is possible for all good events to coincide.

The factor of e here is best possible; this was shown by Shearer [79].
An application: k-SAT with restricted intersections. This is a canonical application of the Lovász lemma.
Let H = (V, E) be a SAT instance in conjunctive normal form (CNF).
That is, V is a set of boolean variables; a literal is a variable v ∈ V or its negation. E is a collection
of clauses, each T ∈ E being a set of literals, which is satisfied if at least one of them is satisfied. H
is satisfied if all T ∈ E are satisfied. So H is a conjunction of disjunctions.
We say that two clauses are neighbors if they share any variable (not necessarily literal).

Corollary 91 Suppose every clause in T ∈ E has size ≥ k and has at most d neighbors. If d + 1 ≤ 2k /e
then H is satisfiable.

Two cases of this corollary are easy: if the total number of clauses is small, or if the clauses are all
disjoint (share no variables). The local lemma uses both effects:

union bound: local (in the dependency graph)
independence: global

Proof: Make a random, uniform assignment to the variables V. For each T there is a “bad event”
BT = no literal in T is satisfied. Pr( BT ) = 2−k . After excluding the edges intersecting T, BT is
independent of the collection of all remaining events, because even finding out the exact coloring
of V within those edges (not merely which events occurred) does not affect the probability of event
BT . Now apply Lemma (90). 2
A closely related application is to NAE-SAT (“Not All Equal”, or “Property B”): Let H = (V, E) be
hypergraph (a set system whose elements we call edges). Specifically V is finite and E ⊆ 2V . H has
Property B if V can be two-colored so that no edge is monochromatic. This is just like SAT except
that two assignments, rather than one assignment, are ruled out per clause.
1 To simplify comparison let us consider only bidirected dependency graphs and undirected graphical models. So in

either case, we are considering a simple undirected graph among the events Bi , i ∈ S. In a graphical model, the condition is
that Bi be independent of ∏ j∈S−({i}∪ Di ) Bj conditional on ∏ j∈ Di Bj .
Consider as an example the following graphical model (a special case of the Ising model): given that b neighbors j of i
b−c
have Bj occur, and that the remaining c neighbors of i have Bcj occur, Pr( Bi ) = eb−ec +ec−b . (Exercise: there is a joint distribution
with this property.) The graph of the graphical model is not a dependency graph for this space. The information that a
single event Bj has occurred, even if very far away from i in the graph, will at least slightly bias toward Bi occurring.
Conversely, consider the above example of three events of which an even number always hold. One (bidirected) depen-
dency graph for this is a path of two edges. However, this graph is not valid for a graphical model on the sample space: if
the path is 1-2-3, B1 is not independent of B3 conditional on B2 . To the contrary, conditional on B2 , B3 determines B1 .

97
Schulman: CS150 2018 CHAPTER 5. LOVÁSZ LOCAL LEMMA

Corollary 92 Suppose every edge T ∈ E has size ≥ k and intersects at most d other edges. If d + 1 ≤ 2k−1 /e
then H has Property B.

Proof: of the local lemma:


  ∆ | R|
c ≥
We demonstrate more concretely that for any finite subset R of S , Pr
T
B
i∈ R i ∆ +1 .
This is a corollary of the following assertion which we prove by induction on m:
For any set of m − 1 events (w.l.o.g. relabeled as) B1 , . . . , Bm−1 and any other event (w.l.o.g. relabeled
as) Bm ,  
1
Bcj  ≤
\
Pr  Bm | . (5.1)
j ≤ m −1
∆+1
| {z }
p1

If m = 1 this is immediate from the hypothesis of the lemma. Now suppose the claim is true for
values smaller than m. Reorder B1 , . . . , Bm−1 so that for some 0 ≤ d ≤ ∆, Dm ∩ {1, . . . , m − 1} =
{m − d, . . . , m − 1} =: D . Write also D 0 = {1, . . . , m − d − 1}. If d = 0, Eqn. (5.1) is again immediate
from the hypothesis of the lemma. Otherwise, write
   

Bcj  | Bcj 
\ \
Pr  Bm ∩ 
j∈D j∈D 0
| {z }
p2
    (5.2)
Bcj  Pr  Bcj | Bcj 
\ \ \
= Pr  Bm |
j ≤ m −1 j∈D j∈D 0
| {z }| {z }
p1 p3

p2
We’re going to upper bound p1 by expressing it in the form p1 = p3 , and showing the upper bound
1
p2 ≤ e ( ∆ +1)
and the lower bound p3 ≥ 1/e.
Term p2 is the 
application of independence at the global level. We use a simple upper bound for it:
Bm ∩ Bcj ⊆ Bm , so
T
j∈D

     

Bcj  Bcj  Bcj 


\ \ \
Pr  Bm ∩  | ≤ Pr  Bm |
j∈D j∈D 0 j∈D 0
| {z }
p2
= Pr( Bm )
1
≤ .
e ( ∆ + 1)

Term p3 is the union bound at the local level. We could in fact write it explicitly as a union bound
but the lemma would suffer the slightly inferior factor of 4 in place of e, so we use the following

98
Schulman: CS150 2018 CHAPTER 5. LOVÁSZ LOCAL LEMMA

slightly slicker derivation.


   

Bcj | Bcj  ∏ Pr  Bcj | Bcj0 


\ \ \
Pr  =
j∈D j∈D 0 m − d ≤ j ≤ m −1 j0 < j
| {z }
p3
d



∆+1
∆



∆+1
> 1/e

Where the first inequality is by induction because every conditional probability on the right-hand
side is of the form (5.1) and involves at most m − 1 sets.
Combining our two bounds we obtain the following from (5.2):
1
p2 e ( ∆ +1) 1
p1 = ≤ = .
p3 1
e
∆+1

99
Schulman: CS150 2018 CHAPTER 5. LOVÁSZ LOCAL LEMMA

5.2 Lecture 27 (5/Dec): Applications and further versions of the


local lemma
5.2.1 Graph Ramsey lower bound
Ramsey’s theorem (the upper bound on the Ramsey number) runs in the opposite direction to
Property B because it establishes the existence of something monochromatic. Not surprisingly,
then, our use of the local lemma will be to provide a lower bound on Ramsey numbers. We already
saw such a lower bound: a simple union bound argument gave R(k, k) ≥ (1 − o (1)) √k 2k/2 . Now
e 2
we will see how to improve this.
k

Theorem 93 R(k, k ) ≥ max{n : e(2k )(k−
n
2) ≤ 2
(2)−1 }. Thus R ( k, k ) ≥ (1 − o (1)) k 2 2k/2 .
e

To see that the condition on n implies the conclusion, raise each side to the power k−1 2 . The e, (2k )
and −1 are inconsequential; we find that if n satisfies the following, then R(k, k ) ≥ n:
−(k−2) log(k−2)+k−2 k2 − k
(1 + o (1))ne k −2 ≤ 2 2( k −2)
k +1
ne/k ≤ (1 + o (1))2 2

Proof: As before, sample a graph from G (n, 1/2). For each set of k vertices the “bad event” of a
k
clique or independent set occurs with probability 21−(2) . For the dependency graph, connect two
subsets if they intersect in at least two vertices. The degree of this dependency graph is strictly less
−2
than (2k )(nk− 2 ) (relying on k ≥ 3, since the theorem is easy for k = 2) because this counts neighbors
with the extra information of a distinguished edge in the intersection, so ∆ + 1 ≤ (2k )(k−
n
2). 2
This bound, due to Spencer [83], improves on the union bound by a factor of only 2. While the
improvement factor is very small, qualitatively it is meaningful. It shows that a certain negative
correlation among edges is possible: you have a graph which is big enough to have on average
many copies of each graph of size k. (The union bound was tailored so that the expected number of
copies of a k-clique was just below 1/2, and the same for a k-indep-set. Now we have twice as many
places to put each of the k vertices, so we expect to see about 2k−1 of each of these subgraphs.) Yet
as you look across different subgraphs of this graph, there is a kind of negative correlation which
prevents the occurrence of these extreme graphs (the k-clique and the k-indep-set).
The FKG inequality helps illustrate that the Lovász local lemma did something truly non-local in
the probability space. Any independent sampling method would result in a monotone event such
as a specific k-clique, being at least as likely as the product of its constituent events (the indicators
of each edge in the clique).

It is a major open problem to improve on either lim inf 1k log R(k, k) ≥ 2 or lim sup 1k log R(k, k) ≤
4. Actually this gap is small by the standards of Ramsey theory. For more on the general topic
see [21].

5.2.2 van der Waerden lower bound


Here is another “eventual inevitability” theorem; as before, the local lemma will provide a counter-
point.

Theorem 94 (van der Waerden [88]) For every integer k ≥ 1 there is a finite W (k) such that every two-
coloring of {1, . . . , W (k )} contains a monochromatic arithmetic sequence of k terms.

100
Schulman: CS150 2018 CHAPTER 5. LOVÁSZ LOCAL LEMMA

The best upper bound on W (k), due to Gowers [42], is2

2k +9
22
W (k ) ≤ 2| 2 {z } .
five two’s

The gap in our knowledge for this problem is even worse than for the graph Ramsey numbers: the
k −1
current lower bound, which we’ll see below, is W (k) ≥ (k2+2)e . (A better bound is known for prime
k.) First we show an elementary lower bound:

k −1 √
Theorem 95 W (k ) > 2 2 k − 1.

Proof: Color uniformly iid. The probability of any particular sequence being monochromatic is
21−k . The union bound shows that all these events can be avoided provided

n ( n − 1 ) 1− k
2 <1 (5.3)
k−1
n −1
(count n places the sequence can start, while the step size is bounded by k −1 ). The bound n ≤
k −1 √
2 2 k − 1 implies 5.3. 2
Now for the improved bound through the local lemma:

2k −1
Theorem 96 (Lovász [30]) W (k) ≥ ( k +2) e
.

Proof: Again color uniformly iid. For a dependency graph, connect any two intersecting sequences.
The degree of this graph is bounded by
( n − 1) k 2
k−1
(k2 choices for which elements they have in common, nk− −1 possible step sizes). Thus all the bad
1
events can be avoided if
1
21−k < ,
k 2 ( n −1)
e (1 + k −1 )
which in turn is implied by the bound in the statement of the lemma.
The improvement here came because a union bound over approximately n2 /k terms was replaced
by a smaller factor of about nk. 2

5.2.3 Heterogeneous events and infinite dependency graphs


There are two generalized forms of the local lemma that come fairly easily.
2 The original bound of van der Waerden is of Ackermann type growth [2]. The first primitive recursive bound, due to
Shelah [80], is this. For any function f : N → N let fb : N → N be fb(1) = f (1), fb(k ) = f ( fb(k − 1)) (k > 1). So, letting
exp (k ) := 2k , the tower function is T = e[
2 2
b or in other words e[
xp . Shelah’s bound is of the form T xp .
[ 2

101
Schulman: CS150 2018 CHAPTER 5. LOVÁSZ LOCAL LEMMA

Heterogeneous events

It is not necessary that we use the same upper bound on Pr( Bj ) for all j. Instead, we can allow
events of various probabilities. Those which are more likely to occur, must have in-edges from
events of smaller total probability. On the other hand less likely events can tolerate more in-edges
(as measured by total probability). This is formulated, in a slightly circuitous way, in the following
version of the lemma.

Lemma 97 Let events Bj and dependency edges E be as before. If there are x j < 1 such that for all j,

Pr( Bj ) ≤ x j ∏ (1 − x k )
(k,j)∈ E

Then
Bcj ) ≥ ∏ (1 − x j ).
\
Pr(
j

The proof method is the same.


 Show inductively
 on m that (for any subcollection of m events and
c
any ordering on them), Pr Bm | j≤m−1 Bj ≤ xm .
T

Infinite dependency graphs

Typically, the restriction that S is finite can be dropped due to compactness of the probability space.
Specifically, suppose that—as in most applications—there is an underlying space of independent
rvs Xk , k ranging over some index set U, and each Xk ranging in some compact topological space Rk .
Moreover suppose that every one of the bad events Bj is a function of only finitely many of the Xk ’s,
say of k ∈ Uj ⊂ U, Uj finite. Suppose moreover that each Bj is an open set in ∏k∈Uj Rk . Then each Bcj
is a closed set in the product topology on ∏k∈U Rk . Since the product topology is itself compact by
Tychonoff’s theorem, it satisfies the Finite Intersection Property: a collection of closed sets of which
any finite subcollection has nonempty intersection, has nonempty intersection. Consequently, under
the additional topological assumptions made here—which are trivially satisfied if each Xk takes on
only finitely many values—the supposition in the local lemma (in either formulation 90 or 97) that
S is finite, may be dropped.

Example

Here is an example that takes advantage of both the above generalizations.


For a word w ∈ Σ∗ , Σ a set which we think of as an “alphabet”, let w = wn . . . w1 be the reversal of
w and let DPal(w) = n1 dHamming (w, w). Large DPal means that w is far from being a palindrome.
Consider the infinite ternary tree T3 . The local lemma implies the existence of strongly palindrome-
avoiding labelings T3 :

Theorem 98 For all δ > 0 there is an integer |Σ| < ∞, and a labeling α : Vertices( T3 ) → Σ, such that for
all words w of length > 1 that you read along a simple path in T3 , DPal(w) ≥ 1 − δ.

(The proof gives |Σ| < exp(1/δ).) This is used in communication theory [76].
Comment: it is an open problem, not only for all δ > 0 but even for any δ < 1, to explicitly construct
such a tree. Namely, we want an algorithm which for the rooted ternary tree, and a vertex v at
distance n from the root, outputs α(v) in time poly(n).

102
Schulman: CS150 2018 CHAPTER 5. LOVÁSZ LOCAL LEMMA

5.3 Lecture 28 (7/Dec): Moser-Tardos branching process algorithm


for the local lemma
Now we describe an algorithm for finding satisfying assignments to the local lemma. The algorithm
works in great generality and achieves the same limiting threshold (whenever the algorithm is
applicable) as the full local lemma; however, for simplicity, we describe it here in a slightly restricted
setting. (Most notably the dependency graph will be symmetric.)
Let H = (V, E) be a SAT instance (see definitions in Sec. 5.1); we can encode most applications of the
local lemma in these terms. As before, say that two clauses are neighbors in the dependency graph
if they share any variable (not necessarily literal). Write n = |V | (number of variables), m = | E|
(number of clauses). Call the clauses T1 , . . . , Tm according to an arbitrary ordering.
We have from last time a corollary of the local lemma:

Corollary 99 Suppose every clause in T ∈ E has size k and has at most d neighbors. If d + 1 ≤ 2k /e then
H is satisfiable.

Moser-Tardos Algorithm [67, 68]


Pick a random assignment to V
While there is an unsatisfied clause, pick the first-indexed such clause T and run Fix( T ).
Fix( T )
Recolor the variables of T u.a.r. until it is satisfied.
While T has an unsatisfied neighbor, pick the first-indexed such neighbor T 0 and run Fix( T 0 ).
(“First-indexed” is a mere convenience, any deterministic order is ok, even depending on the his-
tory of the algorithm so far). Observe that Fix implements a recursive or stack-based exploration
analogous to Depth First Search (DFS), but it is possible for clauses to be revisited.

Theorem 100 If 4(d + 2) ≤ 2k then the Moser-Tardos algorithm finds a satisfying assignment to H in time
Õ(n + mk ).

We are being loose in this presentation about the leading constant of 4 and about the d + 2. These
can be improved to e and d + 1.
We’re also being a bit loose about the run-time. For a bound of O(n + mk) we’ll just keep track of
the number of random bits the algorithm uses. The actual run-time, which includes data structure
management, will be a little larger but only by some factor of about log nm.
Before presenting the proof, let’s see why what we are studying is very similar to a branching
process. Fixing some clause as the root, there is an implicit tree extending out first to neighboring
clauses, then to neighbors of those, and so on. (Of course there may be repetition but that works
out in our favor.) The degree of this tree is d + 1, but our DFS needs to explore only a subtree
of it, generated at random, in which the expected number of children of a node is bounded by
(d + 1)/2k < 1. So, intuitively, what is going on is that a Fix call that is initiated by the main
procedure, tends to terminate after generating a finite DFS tree.
This is only of course intuition, and the formal proof follows.
Proof: The algorithm is implemented with the aid of an (infinite) random bit string z = z1 . . .. The
first n bits are used for the initial assignment. Then, successive bits are used in batches of k for the
Fix procedure.
The choice of z amounts to uniformly choosing a path down a non-degenerate binary tree (no
vertices with one child), whose leaves represent successful terminations of the algorithm. (Note,

103
Schulman: CS150 2018 CHAPTER 5. LOVÁSZ LOCAL LEMMA

this is the tree of random bits, and we only descend in it—it is not the graph of clauses in which we
are performing a DFS-like process!) Of course the tree is infinite (we might endlessly sample bits
badly). However, we will argue that with high probability we reach a leaf fairly soon.
A key observation is that a call to Fix( T ) has the following monotonicity property. If Fix( T ) termi-
nates, then after termination (a) T is satisfied, (b) any clause T 0 that was satisfied before the call is
still satisfied.
(a) is obvious from the text of the procedure. For (b), consider the last time after Fix( T ) started, at
which any of the variables inside T 0 were changed. This change, which left T 0 unsatisfied, occurred
during a call to Fix( T 00 ) where T 00 is T 0 or one of its neighbors. But as we see in the procedure,
we cannot have terminated Fix( T 00 ) while T 0 is unsatisfied. So (since this is all on a stack) we also
cannot have terminated Fix( T ).
Hence, the main procedure calls Fix at most m times on any z.
Let Nt be the number of nodes that the algorithm tree has at depth t. Since the algorithm always
runs the first n steps and then operates in batches of k bits, leaves of the tree occur at the levels
n + sk, after s calls to Fix (whether from the main procedure or recursively from within Fix).
For any such node which actually exists in the tree, what is the probability that the algorithm
reaches it? Since all random seeds z are equally likely, this probability is precisely 2−n−sk .
So what is the expected runtime of the tree? It is the sum of the probabilities that we reach each
node. Namely,
∑ Nt 2−t = n + k ∑ 2−n−sk Nn+sk .
t ≥0 s ≥1

You can see that if we had, say, Nn+sk = 2n+sk ,


this sum would diverge, as of course it should, since
that tree has no leaves at all. But even if there are some leaves, the sum can readily diverge. We
need to show the tree is thin enough that the sum converges.
The method is to devise an alternative way of naming a vertex at level t = n + sk. The obvious way
is to give the bits z1 . . . zt , but that allows us to name 2t vertices, and so will not do. Our naming
scheme must give all vertices distinct names, yet be such that the name space for vertices at depth
t is considerably smaller than 2t .
Call the vertex we are focusing on Z = (z1 . . . zn+sk ). Suppose that rather than being told all the bits
z1 . . . zn+sk , we’re instead told, in order, the arguments (names of clauses) to Fix; plus the assignment
to all the variables at the time we reach Z.
Then we can determine z1 . . . zn+sk by “working backwards.” The last clause, before its recoloring,
had to have been in its unique unsatisfied assignment. In turn, the penultimate clause, before its
recoloring, had to have been in its unique unsatisfied assignment. And so forth.
How many bits are required to specify Z in this alternative way?

1. n bits for the last assignment.


2. We list, in chronological order, each clause as it is pushed on the stack (i.e., is called in
Fix) and when it is done (popped off the stack). Since subsequent Pushes are neighbors
in the dependency graph, this requires only lg(d + 2) bits per call (reserve one symbol for
“Popping”). When we Pop all the way out to the main procedure (which is something we
know has occurred from keeping track of the stack), we just need one bit per each of clause
Ti , to indicate whether the main procedure calls Fix( Ti ).

So,
lg Nn+sk ≤ min{n + sk, n + m + sdlg(d + 2)e}.

104
Schulman: CS150 2018 CHAPTER 5. LOVÁSZ LOCAL LEMMA

Now, as above measuring runtime in terms of how many bits of z we read, we have,

E(runtime) = n + k ∑ 2−n−sk Nn+sk


s ≥1

≤ n + k ∑ 2−n−sk 2n+min{sk,m+s(1+lg(d+2))}
s ≥1

= n + k ∑ 2min{0,m+s(1+lg(d+2)−k)}
s ≥1

Since we have assumed k ≥ 2 + lg(d + 2), this is

≤ n + k ∑ 2min{0,m−s}
s ≥1
m
= n+k ∑ 1+k ∑ 2m − s
s =1 s>m
= n + k ( m + 1).

105
Bibliography

[1] I. Abraham, Y. Bartal, and O. Neiman. Advances in metric embedding theory. In STOC, 2006.
[2] W. Ackermann. Zum hilbertschen aufbau der reellen zahlen. Mathematische Annalen, 99(1):118–
133, 1928. URL: http://dx.doi.org/10.1007/BF01459088.
[3] M. Adams and V. Guillemin. Measure Theory and Probability. Birkhäuser, 1996.
[4] M. Agrawal, N. Kayal, and N. Saxena. PRIMES is in P. Ann. of Math., 160:781–793, 2004.
doi:https://doi.org/10.4007/annals.2004.160.781.
[5] R. Ahlswede and D. E. Daykin. An inequality for the weights of two families of sets, their
unions and intersections. Z. Wahrscheinl. V. Geb, 43:183–185, 1978.
[6] M. Aizenman, H. Kesten, and C. Newman. Uniqueness of the infinite cluster and continuity of
connectivity functions for short and long range percolation. Comm. Math. Phys., 111:505–531,
1987.
[7] N. Alon, L. Babai, and A. Itai. A fast and simple randomized parallel algorithm for the maximal
independent set problem. J. Algorithms, 7:567–583, 1986.
[8] N. Alon and J. Spencer. The Probabilistic Method. Wiley, 3rd edition, 2008.
[9] M. Artin. Algebra. Prentice-Hall, 1991.
[10] I. Benjamini and O. Schramm. Percolation beyond zd , many questions and a few answers.
Electron. Commun. Probab., 1:71–82, 1996. URL: http://dx.doi.org/10.1214/ECP.v1-978,
doi:10.1214/ECP.v1-978.
[11] B. Berger. The fourth moment method. SIAM J. Comput., 26(4):1188–1207, 1997.
[12] S. Berkowitz. On computing the determinant in small parallel time using a small number of
processors. Information Processing Letters, 18:147–150, 1984.
[13] Bernstein inequality. In Encyclopedia of Mathematics. Springer and Europ. Math. Soc. URL:
http://www.encyclopediaofmath.org.
[14] S. N. Bernstein. On a modification of Chebyshev’s inequality and of the error formula of
Laplace. Ann. Sci. Inst. Sav. Ukraine, Sect. Math. 1, 1924.
[15] S. N. Bernstein. On certain modifications of Chebyshev’s inequality. Doklady Akademii Nauk
SSSR, 17(6):275–277, 1937.
[16] P. Billingsley. Probability and Measure. Wiley, third edition, 1995.
[17] B. Bollobás and A. G. Thomason. Threshold functions. Combinatorica, 7(1):35–38, 1987. URL:
http://dx.doi.org/10.1007/BF02579198, doi:10.1007/BF02579198.

106
Schulman: CS150 2018 BIBLIOGRAPHY

[18] J. Bourgain. On Lipschitz embedding of finite metric spaces in Hilbert space. Israel J. Math.,
52:46–52, 1985.
[19] A. L. Chistov. Fast parallel calculation of the rank of matrices over a field of arbitrary charac-
teristic. In Proc. Conf. Foundations of Computation Theory, pages 63–69. Springer-Verlag, 1985.
[20] D. Conlon. A new upper bound for diagonal Ramsey numbers. Ann. of Math., 170:941–960,
2009.
[21] D. Conlon, J. Fox, and B. Sudakov. Recent developments in graph Ramsey theory. In Surveys
in Combinatorics, pages 49–118. Cambridge University Press, 2015.
[22] S. A. Cook. A taxonomy of problems with fast parallel algorithms. Information and Control,
64:2–22, 1985.
[23] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, 1991.
[24] L. Csanky. Fast parallel matrix inversion algorithms. SIAM J. Computing, 5:618–623, 1976.
[25] D. E. Daykin and L. Lovasz. The number of values of Boolean functions. J. London Math. Soc.,
2(12):225–230, 1976.
[26] R. A. DeMillo and R. J. Lipton. A probabilistic remark on algebraic program testing. Informa-
tion Processing Letters, 7(4):193 – 195, 1978. URL: http://www.sciencedirect.com/science/
article/pii/0020019078900674.
[27] L. Engebretsen, P. Indyk, and R. O’Donnell. Derandomized dimensionality reduction with
applications. In SODA, 2002.
[28] P. Erdös. Some remarks on the theory of graphs. Bull. Amer. Math. Soc., 53:292–294, 1947.
[29] P. Erdös. Graph theory and probability. Canad. J. Math., 11:34–38, 1959.
[30] P. Erdös and L. Lovász. Problems and results on 3-chromatic hypergraphs and some related
questions. In Infinite and Finite Sets. North-Holland, 1975.
[31] P. Erdös and G. Szekeres. A combinatorial problem in geometry. Compositio Math., 2:463–470,
1935.
[32] Y. Filmus. Khintchine-Kahane using Fourier Analysis. 2011. URL: http://www.cs.toronto.
edu/~yuvalf/.
[33] P. C. Fishburn and N. J. A. Sloane. The solution to Berlekamp’s switching game. Discrete
Mathematics, 74:263–290, 1989.
[34] C. M. Fortuin, P. W. Kasteleyn, and J. Ginibre. Correlation inequalities on some partially
ordered sets. Commun. Math. Phys., 22:89–103, 1971.
[35] M. Fréchet. Sur quelques points du calcul fonctionnel. Rend. Circ. Matem. Palermo, 22:1–72,
1906. doi:10.1007/BF03018603.
[36] R. Freivalds. Probabilistic machines can use less running time. In IFIP Congress, pages 839–842,
1977.
[37] E. Friedgut and G. Kalai. Every monotone graph property has a sharp threshold. Proc. Amer.
Math. Soc., 124:2993–3002, 1996.
[38] H. N. Gabow and R. E. Tarjan. Faster scaling algorithms for general graph-matching problems.
J. ACM, 38(4):815–853, 1991.

107
Schulman: CS150 2018 BIBLIOGRAPHY

[39] F. Le Gall. Powers of tensors and fast matrix multiplication. In International Symposium on
Symbolic and Algebraic Computation (ISSAC), pages 296–303, 2014. arXiv:1401.7714.
[40] G. H. Gonnet. Expected length of the longest probe sequence in hash code searching. J. Assoc.
Comput. Mach., 28:289–304, 1981.
[41] G. H. Gonnet. Determining equivalence of expressions in random polynomial time. In Proc.
Sixteenth Annual ACM Symposium on Theory of Computing (STOC), pages 334–341. ACM, 1984.
URL: http://doi.acm.org/10.1145/800057.808698.
[42] W. T. Gowers. A new proof of Szemerédi’s theorem. Geom. Funct. Anal., 11(3):465–588, 2001.
URL: http://dx.doi.org/10.1007/s00039-001-0332-9, doi:10.1007/s00039-001-0332-9.
[43] R. L. Graham and V. Rödl. Numbers in Ramsey theory. In Surveys in Combinatorics, London
Math. Soc. Lecture Note Ser. Vol. 123, pages 111–153. Cambridge University Press, 1987.
[44] R. L. Graham, B. L. Rothschild, and J. H. Spencer. Ramsey Theory. Wiley, 2nd edition, 1990.
[45] T. E. Harris. Lower bound for the critical probability in a certain percolation process. Math.
Proc. Cambridge Philos. Soc., 56:13–20, 1960.
[46] J. Håstad. Some optimal inapproximability results. J. ACM, 48(4):798–859, 2001.
[47] M. Heydenreich and R. van der Hofstad. Progress in high-dimensional percolation and random
graphs. Springer, 2017.
[48] W. E. Hickson. In Oxford Dictionary of Quotations, page 251. Oxford University Press, 3rd
edition, 1979.
[49] R. Holley. Remarks on the FKG inequalities. Communications in Mathematical Physics, 36:227–
231, 1974. URL: http://dx.doi.org/10.1007/BF01645980.
[50] P. Indyk and J. Matoušek. Low-distortion embeddings of finite metric spaces. In Handbook of
Discrete and Computational Geometry, pages 177–196. CRC Press, 2004.
[51] W. B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space.
Contemp. Math., 26:189–206, 1984.
[52] V. Kabanets and R. Impagliazzo. Derandomizing polynomial identity tests means proving
circuit lower bounds. Comput. Complex., 13:1–46, 2004.
[53] G. Kalai and L. J. Schulman. Quasi-random multilinear polynomials. Isr. J. Math., 2018. To
appear; arXiv:1804.04828.
[54] R. M. Karp, E. Upfal, and A. Wigderson. Constructing a Maximum Matching is in Random
NC. Combinatorica, 6(1):35–48, 1986.
[55] R. M. Karp and A. Wigderson. A fast parallel algorithm for the maximal independent set
problem. In Proc. 16th ACM STOC, pages 266–272, 1984.
[56] A. Khintchine. Über dyadische Brüche. Math. Z., 18:109–116, 1923.
[57] D. J. Kleitman. Families of non-disjoint subsets. J. Combin. Theory, 1:153–155, 1966.
[58] D. Koller and N. Megiddo. Constructing small sample spaces satisfying given constants. SIAM
J. Discret. Math., 7:260–274, May 1994. Previously in 25th STOC pp.268-277, 1993. URL: http:
//portal.acm.org/citation.cfm?id=178422.178455.
[59] D. C. Kozen. The design and analysis of algorithms. Springer, 1992.

108
Schulman: CS150 2018 BIBLIOGRAPHY

[60] R. Latała and K. Oleszkiewicz. On the best constant in the Khinchin-Kahane inequality. Studia
Mathematica, 109(1):101–104, 1994. URL: http://eudml.org/doc/216056.
[61] F. T. Leighton. Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes.
Morgan Kaufmann, 1992.
[62] R. Lidl and H. Niederreiter. Finite Fields. Cambridge U Press, 2nd edition, 1997. (Theorem
6.13).
[63] N. Linial, E. London, and Y. Rabinovich. The geometry of graphs and some of its algorithmic
applications. Combinatorica, 15(2):215–245, 1995.
[64] M. Luby. A simple parallel algorithm for the maximal independent set problem. In Proc. 17th
ACM STOC, pages 1–10, 1985.
[65] R. Lyons and Y. Peres. Probability on Trees and Networks. Cambridge University Press, 2017.
URL: http://pages.iu.edu/~rdlyons/.
p
[66] S. Micali and V. V. Vazirani. An O( |V | · | E|) algorithm for finding maximum matching in
general graphs. In Proc. 21st FOCS, pages 17–27. IEEE, 1980.
[67] R. A. Moser. A constructive proof of the Lovász local lemma. In Proceedings of the 41st annual
ACM symposium on Theory of computing, STOC ’09, pages 343–350, New York, NY, USA, 2009.
ACM. URL: http://doi.acm.org/10.1145/1536414.1536462.
[68] R. A. Moser and G. Tardos. A constructive proof of the general Lovász local lemma. J. ACM,
57(2):11:1–11:15, 2010. URL: http://doi.acm.org/10.1145/1667053.1667060.
[69] K. Mulmuley, U. V. Vazirani, and V. V. Vazirani. Matching is as easy as matrix inversion.
Combinatorica, 7:105–113, 1987.
[70] D. Mumford. The dawning of the age of stochasticity. In V. Arnold, M. Atiyah, P. Lax, and
B. Mazur, editors, Mathematics: Frontiers and Perspectives. AMS, 2000.
[71] C. M. Newman and L. S. Schulman. Infinite clusters in percolation models. Journal of Statistical
Physics, 26(3):613–628, 1981. URL: http://dx.doi.org/10.1007/BF01011437.
[72] V. Pan. Fast and efficient parallel algorithms for the exact inversion of integer matrices. In
S. N. Maheshwari, editor, Foundations of Software Technology and Theoretical Computer Science,
volume 206 of Lecture Notes in Computer Science, pages 504–521. Springer, 1985. URL: http:
//dx.doi.org/10.1007/3-540-16042-6_29.
[73] S. Rajagopalan and L. J. Schulman. Verification of identities. SIAM J. Comput., 29(4):1155–1163,
2000.
[74] F. P. Ramsey. On a problem of formal logic. Proc. London Math. Soc., 48:264–286, 1930.
[75] I. N. Sanov. On the probability of large deviations of random variables. Mat. Sbornik, 42:11–44,
1957.
[76] L. J. Schulman. Coding for interactive communication. IEEE Transactions on Information Theory,
42(6):1745–1756, 1996.
[77] J. Schwartz. Fast probabilistic algorithms for verification of polynomial identities. J. ACM,
27:701–717, 1980.
[78] C. E. Shannon. A mathematical theory of communication. Bell System Tech. J., 27:379–423;
623–656, 1948.

109
Schulman: CS150 2018 BIBLIOGRAPHY

[79] J. B. Shearer. On a problem of Spencer. Combinatorica, 5:241–245, 1985.

[80] S. Shelah. Primitive recursive bounds for van der waerden numbers. J. Amer. Math. Soc., 1:683–
697, 1988.
[81] L. A. Shepp. The XYZ-conjecture and the FKG-inequality. Ann. Probab., 10(3):824–827, 1982.
[82] D. Sivakumar. Algorithmic derandomization via complexity theory. In STOC, 2002.

[83] J. Spencer. Asymptotic lower bounds for Ramsey functions. Discrete Math., 20:69–76, 1977.
[84] N. Ta-Shma. A simple proof of the isolation lemma. ECCC TR15-080, 2015. URL: http:
//eccc.hpi-web.de/report/2015/080/.
[85] A. Thomason. An upper bound for some Ramsey numbers. J. Graph Theory, 12, 1988.

[86] W. T. Tutte. The factorization of linear graphs. J. London Math. Soc., s1-22(2):107–111, 1947.
URL: http://jlms.oxfordjournals.org/content/s1-22/2/107.short.
[87] L. Valiant, S. Skyum, S. Berkowitz, and C. Rackoff. Fast parallel computation of polynomials
using few processors. SIAM J. Comput., 12(4):641–644, 1983.

[88] B. L. van der Waerden. Beweis einer baudetschen vermutung. Nieuw. Arch. Wisk., 15:212–216,
1927.
[89] Wikipedia. Folded normal distribution. [Online; accessed 6-November-2016]. URL: https:
//en.wikipedia.org/w/index.php?title=Folded_normal_distribution&oldid=748178170.

[90] R. Zippel. Probabilistic algorithms for sparse polynomials. In E. W. Ng, editor, Symbolic and
Algebraic Computation, volume 72 of Lecture Notes in Computer Science, pages 216–226. Springer,
1979. URL: http://dx.doi.org/10.1007/3-540-09519-5_73, doi:10.1007/3-540-09519-5_
73.

110

You might also like