#1: De Bruijn Sequences

Sequences of 0’s and 1’s

Suppose I want to write a sequence of 0’s and 1’s that contains every possible 2-letter subsequence. This means somewhere in my sequence I need 01, 10, 00, and 11.

Obviously, gluing them all together gives a valid sequence:

01100011


This by construction contains each possible 2-letter subsequence. However, some of the subsequences are repeated, so it seems there should be a way to shorten it. In fact, the shortest possible valid sequence is 01100, as it contains 01 starting at the first letter, 11 starting at the second, 10 starting at the third, and 00 starting at the fourth.

Circular Sequences

I now allow looping, so that our sequence is a circular sequence. This means if I start at the last letter, the next one is the first letter. For example, the circular sequence abcd contains the 2-letter subsequences ab, bc, cd, and da.

With a circular sequence, we can shorten the sequence 01100 even further to

0110

which is a 4-letter sequence. The 00 subsequence can be found starting at the fourth letter and looping back around to the first.

Larger Subsequences

What about a sequence that contains all possible 3-letter subsequences: 000, 001, 010, 011, 100, 101, 110, 111? Note that the number of n-letter subsequences of 0’s and 1’s is 2n. This is a simple fact as each position has 2 choices and there are n positions, so the number of choices for n positions is 2n.

For n = 3, it be done with only 8 letters, each sequence starting on a different letter:

00011101

The subsequences can be listed as follows:

0 0 0 1 1 1 0 1 
0 0 0
  0 0 1
    0 1 1
      1 1 1
        1 1 0
          1 0 1
0           0 1 ...
0 0           1 ...

Let’s go one step further. How many does letters does it take to contain all 16 possible 4-letter subsequences?

It turns out there are many such examples where the length is precisely 16. Here is one:

0000111100101101

I encourage you to verify this sequence yourself if you are not fully convinced. You can do this by going through the list and checking that there is a different 4-letter subsequence starting at each position.

Such a shortest sequence of letters containing all possible n-letter subsequences is called a De Bruijn sequence.

Why It Works (Proof)

For any n, let all the 2n n-digit subsequences be vertices on a directed graph. A directed edge goes from a vertex x to a vertex y (not necessarily distinct) if the last (n – 1) letters of x match the first (n – 1) letters of y. For example, when n = 3 there is a directed edge from 110 to 101, but not from 101 to 110. Every vertex has precisely two edges outgoing (110 has 100 and 101), as well as two edges incoming (110 has 011 and 111). Finally, label each edge by the last letter of the target vertex. For example, the vertex from 110 to 101 is labeled 1. Below is a visual demonstration for n = 3:

The graph has a single strongly connected component and every vertex has both outgoing and incoming degrees equal to two. By a theorem on Eulerian circuits, these conditions suffice to imply that the graph admits an Eulerian circuit, which is a path where every edge is traversed precisely once and the path ends at the same vertex as it started. The labels on this circuit complete the De Bruijn sequence.

Since there are 2n edges and every edge is traversed precisely once, the De Bruijn sequence has length 2n.

Extensions

Now suppose we need to generate all possible n-letter subsequences of {0, 1, 2} instead of only {0, 1}. For 2-letter subsequences, the following sequence of length 9 works:

001121022

As for 3-letter subsequences, the following sequence of length 27 contains all of them:

000111210222002120201012211

The existence of these can be proved in a similar fashion as for the case with only {0, 1}. In fact, it extends to all alphabets of size k for positive integer k. Construct a directed graph in the same manner as the proof earlier for the alphabet {0, 1}. Then it is a single strongly connected component and each vertex has outgoing and incoming degree both equal to k, allowing the theorem on Eulerian circuits to hold. This grants the existence of an Eulerian circuit, which in turn corresponds to a De Bruijn sequence. In general, the length of the De Bruijn sequence with an alphabet of k letters and containing all subsequences of length n is kn.

More Info

For more information about De Bruijn sequences, check out the following:

For more information about the background and/or theorems used, see:

Advertisements

2 thoughts on “#1: De Bruijn Sequences

  1. Kinda reminds me of Hamming distances….

    The internet tells me that De Bruijn sequences are used in genome sequencing and 3D imaging, among other things. Pretty cool!

    • Yes, the De Bruijn sequences are related to Hamming distances, in particular Gray codes! The relation is that they both iterate through all subsequences of a certain length with a certain restriction. In a Gray code, the restriction is that the Hamming distance between two adjacent steps is always 1. For instance, for binary 3-tuples, the following is a Gray code:

      000
      001
      011
      010
      110
      111
      101
      100

      Exactly 1 digit changed every step. On the other hand, a De Bruijn sequence shifts all the digits by 1 every time:

      000
      001
      011
      111
      110
      101
      010
      100

      The existence of Gray codes for k-tuples of n characters is proved much differently.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s