Monday, 22 October 2007

Why There Are Only Four Nucleotides

First published in Science Spin magazine (weblink at foot), this is a piece about the Irish scientist Dónall Mac Dónall, who brilliantly used a method used in computer science to solve one of the puzzles of DNA

By Lucille Redmond

TCD chemist Dónall Mac Dónaill has discovered something so blindingly obvious that it lay there under the eyes of scientists ever since DNA was known. He has proved that nature puts its own checking and correcting software in place in our DNA, to stop it producing faulty copies.

Humans - like other living things - are made of billions of cells. Each cell contains the pattern for the whole human - the set of 46 chromosomes. In theory, if you have the pattern, you can knit up the whole person from it - those chromosomes contain all the physical information about the person: blue eyes, good teeth, likely to get sickle cell anaemia.

Inside these chromosomes are genes, made of long, tightly-coiled molecules of DNA. If you uncoiled one of these tiny molecules and stretched it out, it would be more than six feet long.

So a living thing is made of cells, which contain (among other stuff) chromosomes, and those chromosomes contain genes, which are made of DNA. The chromosomes are the instruction manual, which is written in DNA.

All life, from bacteria to an elephant, has its blueprint encoded in DNA.

Our cells reproduce by making replacements of themselves. Each one cell splits into two, and that into four, and so on. And each daughter cell has its own copy of the blueprint to make the whole organism - eyelashes, teeth, leaves or whatever.

The important thing for our purposes in this article is how the DNA makes a copy of itself.

The DNA (DeoxyriboNucleic Acid) that makes up the cells is shaped like two strings twisted into a spiral, connected by 'nucleotides' - molecules mostly made from carbon, oxygen and nitrogen. This is the 'double helix' discovered 50 years ago by Watson and Crick.

Nucleotides are the letters of the words that make up our instruction manual for the leaf or elephant or microbe.

Nature uses four nucleotides to make DNA. They go by the noms de guerre A, T, C and G: adenine, thymine, guanine and cytosine.

The nucleotides are strung along the two long strings, gluing themselves on at the back, and reaching out at the front towards their opposite numbers on the opposing string.

One important thing about these nucleotides: there's a right one for everyone. A T and C G.

Not only that, but (how different from the rest of us) they actually match perfectly with their love objects.

The way nucleotides mate (as it were) is by clamping together at three points. Each nucleotide has three points - their dangly bits, as you might say - which are either hydrogen atoms or 'lone pairs'.

(The analogy of 'mating' and sexual reproduction is, of course, only an analogy. There isn't any real mating going on here.)

If you're a C nucleotide, you'll have these points arranged in this order: hydrogen, lone pair, lone pair. And your beloved G will have them arranged in this order: lone pair, hydrogen, hydrogen. A perfect fit for each other. (Well, in fact, only two of A's and T's points work, which muddies the water a bit, but it works as if all three match in all the opposites.)

So on the two opposite strings of DNA, C is always opposite G, and T is always opposite A.

T has lone pair, hydrogen and lone pair in places one, two and three; and A has hydrogen, lone pair and hydrogen. Perfect for each other.

On each nucleotide, the hydrogen atom is attracted to the lone pair on its opposite number, and vice versa. So C and G will clamp together perfectly, and so will A and T. It's as if you had a three-pin plug and socket. But instead of the plug having three pins and the socket three holes, the plug had two pins and a hole, and the socket two holes and a pin.

Mechanically this works because a 'lone pair' is an electron-rich area, and the hydrogen atom is attracted to the electrons.

The nucleotides have another difference: they come in two sizes, large and small. A and G are large, for instance - they have two rings - and T and C are small, with one ring.

So - and if you're under 16 maybe you should stop reading now - when the DNA strands pull apart, they make babies.

Only one of these strings is used by the DNA - the purpose of the other one, its negative version, is to reproduce the useful one.

There is a long strand of DNA - two strings made of sugar and phosphate, with nucleotides strung along them - A, C, G, T, in any order, on one side, and their opposite and partner on the other side. So where you have A, C, G, T on one side, the other side will have T, G, C, A. The first A links up with the T opposite, the second C with G, the third C with G and the fourth T with A.

And so on along the double strand: always A mated with T and C with G. Life is good.

DNA carries its information in the order that the nucleotides are strung along the sugar-and-phosphate strand. ACT means something different from TAC.

When the two strands pull apart, every A has produced a T, and every T an A. Every C will have produced a G, and every G a C. So two new strands now exist - each one a mirror image of the string that produced it.

That's how DNA is reproduced in all living things.

What has baffled scientists for some time is this: other nucleotides are possible - indeed others exist. So why doesn't nature use them in the DNA strands? Surely they'd increase the possibilities in coding?

Why not use, say, X - a nucleotide with its hydrogen atom and lone pairs arranged as lone pair, hydrogen, lone pair - and K - arranged as hydrogen, lone pair, hydrogen? Having an extra pair of nucleotides would enormously increase the amount of information you could string along that chain.

After all, it's actually possible to reproduce these other nucleotides in the same way DNA reproduces A, C, T and G. Indeed, around 15 years ago a scientist called JA Piccirilli successfully made X and K, and reproduced them.

There seemed to be no obvious reason why nature did not use all 16 possible patterns of hydrogens, lone pairs and sizes of nucleotides in the DNA spiral. Indeed, there was even a suspicion that nature was a big fat lazybones, and had just gone for the easiest option.

But Mac Dónaill suspected there might be a better explanation. Maybe nature was being careful.

Luckily, he had a background in computer science, as well as chemistry. So he was familiar with a system invented by a Bell Labs scientist 50 years ago.

Back in 1950, Richard Hamming of Bell Labs was one of the pioneers of computer science. Hamming invented a way of making data transmission more accurate, by adding an extra 'bit' to every chunk of information sent.

Let's say you're sending a piece of information along a wire to your pal Gene. You agree with each other that it will be sent in three-digit chunks. But somewhere along the wire there's an interruption, and one of the chunks gets scrambled. How's Gene to know?

Hamming's idea was to send, say, three-digit bursts of 1s and 0s, and to add a 'parity bit' to the end of each set of digits - an extra 0 or 1 that would make the set always have an even number of 1s. So if a set with an odd number of 1s came through, it was obviously wrong. And with a bit of fiddling, the system even allowed the errors to be corrected once they were recognised.

(Or you can set it up to always send an odd number of 1s - it doesn't matter once it's agreed between sender and receiver.)

This system is used in every form of electronic information now - credit card transactions, booking airline tickets, making phone calls, and so on. Any data transmission uses it.

Mac Dónaill had the revolutionary thought: is it possible that nature used the same method in DNA? Could that be why only the four nucleotides A, C, T and G were chosen, instead of the rich array of possible nucleotides out there?

"Piccirilli proved in the 1990s that at least in principle some of these extra nucleotides did actually work with our existing molecular machinery. He made some of the other nucleotides which are not commonly employed in nature - some of the other patterns - presented them to polymerase - the copier which copies the strands of nucleotides - and they worked. That actually gave a little bit more impetus to the question of why nature didn't use these extra nucleotides.

"My starting point was that when we look at nucleotides, we tend to see the chemical representation. We don't see the information content so clearly," says Mac Dónaill.

"So I decided to show these patterns of hydrogens and lone pairs as ones and zeroes. There are three positions where you could have a hydrogen or a lone pair - that gives you up to eight possible patterns.

"There are additionally two sizes, and each pattern could be written on a large or on a small nucleotide, giving a total of 16 distinct nucleotides," says Mac Dónaill. "I decided to show the large rings as a 0 and the small as a 1, to complete the numerical view.

"I was just looking at the patterns, and the patterns were expressed by numbers - and so all that was now left in what I was now looking at was the information. I observed almost immediately, as soon as I had made this step, that if I divided them into the two parities of odd and even, all of the natural nucleotides which nature uses in DNA have the same parity.

"A nucleotide makes a copy by making a negative, and there are four other nucleotides which go in there. You want the correct one to match perfectly, but you also want to make sure that the wrong ones will match as seldom as possible.

"Two nucleotides - one odd, the other even - will occasionally actually fit, in a large minority of the time. Whereas if you use only all even parity nucleotides or all odd parity nucleotides, you'll find that the opportunity for a mismatch to occur and actually get through is considerably reduced."

For example, Piccirilli's nucleotides, X and K, were a perfect match for each other - the pattern on X was lone pair, hydrogen, lone pair (the same as T); and K had hydrogen, lone pair, hydrogen.

But now the green-eyed monster appears: X can also mate with C. Even though their hydrogens and lone pairs don't match up perfectly, it's chemically possible because the size of their rings fools them. Large rings always mate with small, and small with large - that's how it happens.

Poor G! Is there no decency in this world, even at the molecular level?

Worse still, when X (which is odd parity) does mate with C (which is even parity), we have trouble. X may have a copy of C when the strands separate, but in subsequent copies, C is more likely to come out with a copy of G after all.

The trouble is that this means that the new string that's been created is now wrong - in effect, the blueprint is wrong. The living cell that's formed using the new DNA strand - an incorrect blueprint - might work right, but it might not work, or it could work wrong. Or it could just work differently.

"Once it's on paper in front of you as a problem, it's a simple exercise with pen and paper," says Mac Dónaill. "Quite seriously, once you've made the jump from molecular structure into binary numbers, it really is a problem that you could give at the end of secondary school or first year computer science. So I just did it with pen and paper."

The way he did the test was this: he assigned the value 1 to hydrogen, and the value 0 to a lone pair. And then he assigned the value 1 to a double ring, and 0 to a single one. That meant that C's value was 100,1 (hydrogen, lone pair, lone pair; single ring). G was 011,0 (lone pair, hydrogen, hydrogen; double ring).

He drew a hypercube - a cube within a cube - to give himself an image of four dimensions, and he mapped directions to those binary numbers. This is a way used in computer science to show the relationships between "code words" - sets of binary numbers - used in data transmission, and to test how easily errors can be recognised.

The small rings were shown on the inner cube, and the large rings on the outer cube. And the binary numbers of the hydrogens and lone pairs were the up, down, left and right, back and forward positions.

The first bit determined left or right (0 left, 1 right), the second bit determined front (0) or back (1); the third bit determined down (0) or up (1), and the fourth - the bit for the size of the rings - the inner (1) or outer (0) cube.

Then he worked out what positions in the hypercube the various nucleotides would reach. And he found that it worked: A would always fit T, and C would always fit G well - whereas the likelihood of a mismatch was much greater with the other possible patterns.

He had used the hypercube model to prove that the patterns worked. But what of the chemistry?

When he looked at the nucleotides in this way, the odd parity set of eight didn't look as good a model as the even parity set. "In the odd parity set of eight, it seems that six of them are not chemically viable, or their patterns are unstable - the hydrogens would move. It's as if you have a lock where the teeth in the lock actually move about - that wouldn't be very satisfactory - sometimes the key would work, the next day it wouldn't."

In the eight nucleotides whose binary numbers work out as even parity, four were unstable. "Of the eight patterns in the even set, four of them are not chemically very stable - but this time only four. When I eliminated the unstable ones, I was left with A, C, T and G."

What he had discovered was simple - astoundingly simple, but an earth-shattering discovery in genetic terms.

What Mac Dónaill discovered was that nature uses the size of the nucleotides as a parity bit - as the extra, error-resisting, piece of information that makes sure the information transmitted is correct.

He couldn't believe what he was seeing at first. It was too simple to be true. He moved on to other problems in error coding. And he had a large burden of administrative work - at the time he was director of the computational chemistry programme in TCD. So he didn't have a lot of time available for pure science in any case.

"I didn't publish it for two reasons: I didn't have the time to verify it, and I needed to check also that nobody else had done this. But partly the solution was actually so simple - I just didn't believe that nobody had published it. I was quite frankly a little bit worried that I was going to make an eejit of myself.

"It took time - I searched the literature very carefully, and I checked this again and again, and I tried to make sure as objectively as possible that there wasn't something that I'd missed."

In June 2002 Mac Dónaill submitted his paper to the leading publication Chemical Communications, and got raves from the review board. It was picked up by Science and Nature - the world's two leading science journals.

Then the Mathematical Association of America wrote a piece on his discovery, then Science News. Then national organisations all over the world flocked in - academies of sciences, then science magazines, both popular and official, wrote shorter or longer articles. The Chinese Academy of Science and the Hungarian Academy of Sciences covered it. It even made its way into fundamentalist Christian publications in the US.

Mac Dónaill has now had a number of invited papers, in publications like the journal Origin of Life, and the Journal of Molecular Physics. The IEEE, one of the leading engineering societies in the world, has invited him to speak at a conference, on information theory in molecular biology.

"Some years ago Richard Dawkins made a rather controversial statement," says Mac Dónaill. "He said: 'If you want to understand life, don't think about throbbing gels, think about information technology.' So the idea has been around in broader terms for some time that life is at heart an informational, computational process.

"In a sense, nobody would confuse, to give an analogy, the program that they've got written on their CD or minidisc with the hardware of the disc itself: those are quite distinct conceptually.

"There is a suggestion that perhaps to some extent we have made that mistake in molecular biology - we are confusing the hardware of life with the life process, which may be more like - many of the aspects or features of life, the magic of life, is in the information, and the chemicals provide the medium.

"So we think of matter as living, but in a very real way we have a program encoded in matter - it's really the program which is living. But it's written in DNA. You might call it 'slimeware'.

"But at heart, whether it's written in DNA or written in a magnetic or optical material on discs or CDs does not change the fundamental nature of the information."

Mac Dónaill has made a stunning discovery - a triumph for science, and a huge step forward. And it is a discovery that has come out of an Irish university, from the thought and research of an Irish scientist. This basic, fundamental observation of the behaviour of nucleotides has revolutionised the way that we look at the coding of DNA.


First published in Science Spin magazine,
© Lucille Redmond


Anonymous said...

I am confused as to whether the FOUR base pairs are a consequence of the TRIPLET CODE or is it vice versa? Please shed some light.

Pageturners said...

The man to ask is Donall Mac Domhnaill, who you'll find in University College Dublin.

BioMed said...

Great read, it’s actually going to help me answer my midterm question. I would like to point out that the base pairing as described here (AT , GC) is not “always” the case. In RNA, U sometimes pairs with G (knows as the wobble pair) in DNA, 4 Gs sometimes create a very stable tetraplex complex, and other non Watson-crick base pairings are also possible.