Weblogo bits

Weblogo[1,2] is a super mature (its first version was published in the early 90’s) and is a neat peace of software frequently used by bioinformaticians. The tool is mostly used to ~~create consensus sequences~~ summarize nucleotides enrichment in sequence conservation, motif searches, nucleotide enrichment. Its input is a multiple alignment of nucleotide or amino acids sequences. The output is a eye candy plot with positions on the x-axis and a measurement of nucleotide (or amino-acid) frequency on the y-axis. There are two vastly used frequency measurements: 1) Nucleotide frequency and 2) Information content.

Figure1. Example of a nucleotide WebLogo figure.

The nucleotide frequency plot is straight forward to understand. It literally means the frequency of a nucleotide at a given position of your multiple alignment. I realized however, that the unit “bits" of the y axis in the information content mode, despite being intuitive visually, is trickier to interpret. At this point, after a few google searches, I was sure it had something to do with information theory, so wikipedia was my first stop. Content in the internet though, is almost like being thirsty in the middle of a ocean. There is information everywhere, none of is very useful. Therefore I decided to dump a few thoughts in this first blog post.

From the information theory we need to borrow a first concept. Uncertainty. Which measures how much uncertainty there is in a given… Message. Bear with me. Let’s use an anecdote different from a coin toss (first example in every single text I’ve seen so far. For a messenger RNA, for example, what is the uncertainty at each position of an RNA? Well, we could use four nucleotides, right? A,C,U,G. The information theory loves bits though. Why use 4, if you can information in bits? There is actually a good excuse for this, but it is not relevant at this point. Let’s use 0s and 1s then. How many bits would be needed to represent the uncertainty at each nucleotide of an RNA? 2 right? 00=A, 01=T, 10=C, 11=G. Good. There is a simple mathematical operation for easily getting this result, the log operation (log2(4)=2). What about amino acids in a protein? Well, log2(20)=4.32. That’s great! That explains why weblogos from nucleotide sequences have at most 2 bits of uncertainty and amino acids plots have at most approximately 4 bits.

Figure2. Example of a amino acid WebLogo figure.

Now, the weblogo does not plot the uncertainty of a given base in a given position but rather the information content of each base at a single position. Therefore, if at position 0, G has 2 bits of information content in a 2 bits of uncertainty message (like in Figure 1) that means that, for most of the messages, or DNAs in this specific case, you will find a G at position 0, and T at position one. What about position 3? From position 3 we learn that G/A have approximately 0.5 bit of information content in a single bit of uncertainty (sum of their bits). That means that the nucleotide in the third base is like a flipping a coin, the third position can be coded in one bit - A zero or a one; or A and G they have the similar chances of being chosen, 50% of appearing in your message.

Here is an example. Using the following sequences and input:

GTA

GTG

The result from Weblogo is:

Visually one could wrongly infer that the frequency of A/G at the third position is smaller than 50% since the sum of their bits is not 2. But that’s a mere outcome from the information theory (and you can find more in the links bellow) - roughly, the more information you have, the smaller is your uncertainty. Be careful when reading weblogo plots!