Last time we left you with the mother of all molecular cliffhangers: how can it be that the simple four-letter code of DNA can carry the information to make all life? Early in that piece we’d thrown in the fact that the human genome (i.e. our DNA) is made up of three billion letters. As Watson and Crick showed 53 years ago, it’s actually two intertwined molecules, each with three thousand million letters – but it’s the number that’s important because that carries all that’s needed to make you and me.
But, if you’re like me, you have real problems grasping the meaning of numbers much above 100 – so that ‘millions’, yet alone ‘billions’, come across simply as ‘lots’– and we’re left shaking in our head in bewilderment as to how it works.
A different angle
To get some sort of a grip on the scale of information that genomes can carry, it might be helpful to look at DNA from the other end, so to speak. This approach started five years ago among a group who work on applying computer technology to handling biological data – i.e. how to acquire, store, analyse and interpret the tsunami of genetic information now being produced. It’s a new field called bioinformatics.
What set the bioinformatics bods thinking is a point that will have occurred to you as an internet user (and who isn’t?). How can we deal with the unimaginable amount of info we want to store? That includes everything from your holiday snaps to the tons of scientific data, including the continuing flood of genomics. If ‘millions’ leaves you boggling, how about the estimate for the global digital archive of 44 trillion gigabytes by 2020 (I think that’s 44 followed by 21 zeros). That’s a 10-fold increase from 2013.
Whatever the numbers are, they’re unimaginable but, aside from being boggled by the facts, a slight problem is that storing that amount on conventional memory sticks would use at least 10 times the amount of available silicon. So, as they say, we have a problem.
DNA to the rescue?
The boffins worked out that if you could use the storage capacity of DNA as efficiently as possible, the length you’d need to squeeze in all those trilla-bytes would correspond to about a kilogram of DNA. Put another way, the storage density of DNA is 1000 times that of flash memories. How would that work? Well, in principle it’s a simple, four-step process:
- Convert text to binary code (ones and zeroes)
- Convert binary code to triplet code (called ‘trits’: zeroes, ones and twos of a 3-digit code)
- Use trit code to make DNA (0, 1 or 2 translated into a base, A,T,G or C, that differs from the one just used.
- DNA made in overlapping fragments (to give 4 copies of each piece of code)
One of the first experiments encoded Shakespeare’s sonnets in DNA, which showed that the idea was feasible – what scientists call a ‘proof of principle’. Of course, that’s only a beginning. There are big problems to overcome, like being able to make DNA strands cheaply and quickly enough and to be able to access the data required with the ease we’re used to with hard drives and flash memories. On the flip side, DNA preserved in permafrost has been sequenced from woolly mammoths tens of thousands of years old and from horses entombed for 700,000 years, so we know that as a storage medium it’s rather more durable than anything currently in use.
For the record
The key point here is that, at the moment, DNA appears to be the only option if we are not to grind to a halt on the information storage front. Regardless of solving the problems involved, that alone gives a new perspective to the coding power of those four little bases, A, C, G and T.
Extance, A. (2016). How DNA could store all the world’s data. Nature 537, 22–24.
Goldman, N. et al. (2013). Nature 494, 77-80.
Orlando, L. et al. (2013). Nature 499, 74-78.