**Sequences of 0’s and 1’s**

Suppose I want to write a sequence of 0’s and 1’s that contains every possible 2-letter subsequence. This means somewhere in my sequence I need 01, 10, 00, and 11.

Obviously, gluing them all together gives a valid sequence:

**01100011**

This by construction contains each possible 2-letter subsequence. However, some of the subsequences are repeated, so it seems there should be a way to shorten it. In fact, the shortest possible valid sequence is 01100, as it contains 01 starting at the first letter, 11 starting at the second, 10 starting at the third, and 00 starting at the fourth.

**Circular Sequences**

I now allow looping, so that our sequence is a **circular sequence**. This means if I start at the last letter, the next one is the first letter. For example, the circular sequence **abcd** contains the 2-letter subsequences **ab**, **bc**, **cd**, and **da**.

With a **circular sequence**, we can shorten the sequence 01100 even further to

**0110**

which is a 4-letter sequence. The 00 subsequence can be found starting at the fourth letter and looping back around to the first.

**Larger Subsequences**

What about a sequence that contains all possible 3-letter subsequences: 000, 001, 010, 011, 100, 101, 110, 111? Note that the number of **n**-letter subsequences of 0’s and 1’s is 2** ^{n}**. This is a simple fact as each position has 2 choices and there are

**n**positions, so the number of choices for

**n**positions is 2

**.**

^{n}For **n = 3**, it be done with only 8 letters, each sequence starting on a different letter:

**00011101**

The subsequences can be listed as follows:

0 0 0 1 1 1 0 10 0 0 0 0 1 0 1 1 1 1 1 1 1 0 1 0 1 0 0 1 ... 0 0 1 ...

Let’s go one step further. How many does letters does it take to contain all 16 possible 4-letter subsequences?

It turns out there are many such examples where the length is precisely 16. Here is one:

**0000111100101101**

I encourage you to verify this sequence yourself if you are not fully convinced. You can do this by going through the list and checking that there is a different 4-letter subsequence starting at each position.

Such a shortest sequence of letters containing all possible **n**-letter subsequences is called a **De Bruijn sequence**.

**Why It Works (Proof)**

For any **n**, let all the 2^{n}^{ }**n**-digit subsequences be vertices on a directed graph. A directed edge goes from a vertex **x** to a vertex **y** (not necessarily distinct) if the last **(n – 1)** letters of **x** match the first **(n – 1)** letters of **y**. For example, when **n = 3** there is a directed edge from 1**10** to **10**1, but not from 101 to 110. Every vertex has precisely two edges outgoing (1**10** has **10**0 and **10**1), as well as two edges incoming (**11**0 has 0**11** and 1**11**). Finally, label each edge by the last letter of the target vertex. For example, the vertex from 110 to 10**1** is labeled **1**. Below is a visual demonstration for **n = 3**:

The graph has a single strongly connected component and every vertex has both outgoing and incoming degrees equal to two. By a theorem on Eulerian circuits, these conditions suffice to imply that the graph admits an Eulerian circuit, which is a path where every edge is traversed precisely once and the path ends at the same vertex as it started. The labels on this circuit complete the **De Bruijn sequence**.

Since there are 2** ^{n}** edges and every edge is traversed precisely once, the De Bruijn sequence has length 2

**.**

^{n}**Extensions**

Now suppose we need to generate all possible **n**-letter subsequences of **{0, 1, 2}** instead of only **{0, 1}**. For 2-letter subsequences, the following sequence of length 9 works:

**001121022**

As for 3-letter subsequences, the following sequence of length 27 contains all of them:

**000111210222002120201012211**

The existence of these can be proved in a similar fashion as for the case with only **{0, 1}**. In fact, it extends to all alphabets of size **k** for positive integer **k**. Construct a directed graph in the same manner as the proof earlier for the alphabet **{0, 1}**. Then it is a single strongly connected component and each vertex has outgoing and incoming degree both equal to **k**, allowing the theorem on Eulerian circuits to hold. This grants the existence of an Eulerian circuit, which in turn corresponds to a De Bruijn sequence. In general, the length of the **De Bruijn sequence** with an alphabet of **k** letters and containing all subsequences of length **n** is **k ^{n}**.

**More Info**

For more information about **De Bruijn sequences**, check out the following:

- Wiki: http://en.wikipedia.org/wiki/De_Bruijn_sequence
- Textbook: Van Lint & Wilson.
*A Course in Combinatorics*. “De Bruijn Sequences” (Chapter 8).

For more information about the background and/or theorems used, see:

- Wiki: http://en.wikipedia.org/wiki/Graph_(mathematics)
- Wiki: http://en.wikipedia.org/wiki/Directed_graph
- Wiki: http://en.wikipedia.org/wiki/Eulerian_path

Kinda reminds me of Hamming distances….

The internet tells me that De Bruijn sequences are used in genome sequencing and 3D imaging, among other things. Pretty cool!

Yes, the De Bruijn sequences are related to Hamming distances, in particular Gray codes! The relation is that they both iterate through all subsequences of a certain length with a certain restriction. In a Gray code, the restriction is that the Hamming distance between two adjacent steps is always 1. For instance, for binary 3-tuples, the following is a Gray code:

000

001

011

010

110

111

101

100

Exactly 1 digit changed every step. On the other hand, a De Bruijn sequence shifts all the digits by 1 every time:

000

001

011

111

110

101

010

100

The existence of Gray codes for

k-tuples ofncharacters is proved much differently.