Friday, September 14, 2007

Genome Size, Complexity, and the C-Value Paradox

Forty years ago it was thought that the amount of DNA in a genome correlated with the complexity of an organism. Back then, you often saw graphs like the one on the left. The idea was that the more complex the species the more genes it needed. Preliminary data seemed to confirm this idea.

In the late 1960's scientists started looking at the complexity of the genome itself. They soon discovered that large genomes were often composed of huge amounts of repetitive sequences. The amount of "unique sequence" DNA was only a few percent of the total DNA in these large genomes.1 This gave rise to the concept of junk DNA and the recognition that genome size was not a reliable indicator of the number of genes. That, plus the growing collection of genome size data, soon called into question the simplistic diagrams like the one shown here from an article by John Mattick in Scientific American (Mattick, 2004). (There are many things wrong with the diagram. Can you identify all of them? See What's wrong with this figure? at Genomicron).

Today we know that there isn't a direct correlation between genome size and complexity. Recent data, such as that from Ryan Gregory's website (right) reveals that the range of DNA sizes in many groups can vary over several orders of magnitude [Animal Genome Size Database]. Mammals don't have any more DNA in their genome than most flowering plants (angiosperms). Or even gymnosperms, for that matter.

Many of us have been teaching this basic fact for twenty years. The bottom line is ....
Anyone who states or implies that there is a significant correlation between total haploid genome size and species complexity is either ignorant or lying.
It is notoriously difficult to define complexity. That's only one of the reasons why such claims are wrong. Ryan Gregory wants everyone to know that the figure showing genome sizes in different phylogenetic groups is not meant to imply a hierarchy of complexity from algae to mammals.

A recent paper by Taft et al. (2007) says complexity can be "broadly defined as the number and different types of cells, and the degree of cellular organization." We can quibble about the definition but there's nothing better that I know of. The real question is whether organism complexity is a useful scientific concept.

Here's the problem. Have some scientists already made up their minds that mammals in general, and humans in particular, are the most complex organisms? Do they construct a definition f complexity that's guaranteed to confer the title of "most complex" on humans? Or, is complexity a real scientific phenomenon that hasn't yet been defined satisfactorily?

I, for one, don't know whether humans are more complex than an owl, or an octopus, or an orchid. For all I know, humans may be less complex by many scientific measure of complexity. Plants can grow and thrive on nothing but water, some minerals, and sunlight. We humans can't even make all of our own amino acids. Does that make us less complex than plants? Certainly it does at the molecular level.

Back in the olden days, when everyone was sure that humans were at the top of the complexity tree, the lack of correlation between genome size and complexity was called the C-value paradox where "C" stands for the haploid genome size. The term was popularized by Benjamin Lewin in his molecular biology textbooks. In Genes II (1983) he wrote.
The C value paradox takes its name from our inability to account for the content of the genome in terms of known function. One puzzling feature is the existence of huge variations in C values between species whose apparent complexity does not vary correspondingly. An extraordinary range of C values is found in amphibians where the smallest genomes are just below 109bp while the largest are almost 1011. It is hard to believe that this could reflect a 100-fold variation in the number of genes needed to specify different amphibians.
So, the paradox arises even if we don't know how to rank flowering plants and mammals of a complexity scale. It arises because there are so many examples of very similar species that have huge differences in the size of their genome. Onions, are another example—they are the reason why Ryan Gregory made up the Onion Test.
The onion test is a simple reality check for anyone who thinks they have come up with a universal function for non-coding DNA. Whatever your proposed function, ask yourself this question: Can I explain why an onion needs about five times more non-coding DNA for this function than a human?
Imagine the following scenario. You are absolutely convinced that humans are the most complex species but total genome size doesn't reflect your conviction. The C-value paradox is a real paradox for you. Knowing that much of our genome is possibly junk DNA still leaves room for plenty of genes. You take comfort in the fact that under all that junky genome, humans still have way more genes than simple nematodes and flowering plants. You were one of those people who wanted there to be 100,000 genes in the human genome [Facts and Myths Concerning the Historical Estimates of the Number of Genes in the Human Genome].

But when the genomes of these species are published, it turns out that even this faint hope evaporates. Humans, Arabidopsis (wall cress, right), and nematodes all have about the same number of genes.

Oops. Now we have a G-value paradox, where "G" is the number of genes (Hahn and Wray, 2002). The only way out of this box—without abandoning your assumption about humans being the most complex animals—is to make up some stories about the function of so-called junk DNA. If it turns out that there are lots of hidden genes in that junk then maybe it will rescue your assumption. This is where we get some combination of the excuses listed in The Deflated Ego Problem.

On the other hand, maybe humans really aren't all that much more complex, in terms of number of genes, than wall cress. Maybe they should have the same number of genes. Maybe the other differences in genome size really are due to variable amounts of non-functional junk DNA.

1. Thirty years ago we had to teach undergraduates about DNA reassociation kinetics and Cot curves—the most difficult thing I've ever had to teach. I'm sure glad we don't have to do that today.

Hahn, M.W. and Wray, G.A. (2002) The g-value paradox. Evol. Dev. 4:73-75.

Mattick, J.S. (2004) The hidden genetic program of complex organisms. Sci Am. 291:60-67.

Taft, R.J., Pheasant, M. and Mattick, J.S. (2007) The relationship between non-protein-coding DNA and eukarotic complexity. BioEssays 29:288-200.

[Photo Credits: The first figure is taken from a course webite at the University of Miami (Molecular Genetics. The second figure is from Ryan Gregory's Animal Genome Size Database (Statistics).]


  1. Thirty years ago we had to teach undergraduates about DNA reassociation kinetics and Cot curves—the most difficult thing I've ever had to teach. I'm sure glad we don't have to do that today.

    I learned about cot curves in a molecular evolution course I took less than ten years ago. It was taught by an ol' school geneticists, so it's not representative of molecular evolution courses in general. But people do still learn about that stuff.

  2. I was also teaching it about twelve years ago when I last taught an upper-level molecular genetics course. It always amused me to see seniors who had supposedly aced calculus three years previously get the deer in the headlights look when confronted with simple algebra. Always made me wonder just what the hell the math department was actually teaching.

  3. Absolutely fascinating article Larry! Lots of challenges. You seem to be raising a question over that final human ‘conceit’ - that is, in the face of the ever-advancing principle of cosmic mediocrity that started with Copernicus, human pride might at least find some comfort in the thought that we are manifestly the most complex things this side of the nearest star! (and perhaps quite a few other stars as well!).

    A necessary condition of complexity is variety, although ‘variety’ isn’t a sufficient condition to define the working systems of life, it clearly subsets them. Well, using this necessary condition to define complexity, perhaps there is some mileage to be gained from the following line of thought:

    It has always been clear in the field of computation that complexity of output arises not just because of complex starting information (like complex DNA info), but also because of the length of time a computation takes place; that is, the complexity of the final output derives from two computational resources, namely the complexity of the initial information and the generation time.

    Hence, applying these ideas it would seem to me that assembly time also has a bearing when considering the comparative complexity of organisms. With humans the ‘assembly time’ needs to be taken into account; and that assembly time must also include the assembly of proteins that we don’t manufacture ourselves but take on board as food. In short being at the top of the food chain our true DNA sequence implicitly includes parts of the DNA sequence of organisms from which we derive proteins; our effective DNA sequence is a concatenation of DNA sequences from other organisms. Moreover, a viable human also requires a lot of social training and perhaps that should be taken into account to.

    So, for all you human complexity chauvinists out there, there is hope for you yet!

  4. Timothy V Reeves wrote: "In short being at the top of the food chain our true DNA sequence implicitly includes parts of the DNA sequence of organisms from which we derive proteins...."

    I am sure the notion that we are "at the top of the food chain" will be a great reassurance to you and any companions the next time you hike where there are grizzlies or swim where there are great white sharks.

  5. I think it is clear that the "complexity is beautiful" guys are not only burning their candle from both ends, but has a blow torch on the middle.

    In the one end complexity has a funky relationship with information and descriptions. Kolmogorov complexity (algorithmic information) is a measure of the resources needed to specify an object. A random string takes the most information to specify.

    In the other end the difficulty with defining complexity is because there is no single measure that can capture all structural characteristics.

    Taft's organism measure seems like a good measure as a first approximation to capture general complexity, but it leaves out many traits, behavioral complexity, et cetera.

    Another measure with descriptive power for biological systems is mutual information, which is linked to Shannon information. It is claimed to have been used to characterize RNA secondary structure, but in any case mutual information can be used to define neural complexity.

    This complexity is lowest for regular or random systems of elements, but highest for networks with order on all scales such as brains. (Or relationships, internets, glasses, et cetera.)

    The different meanings of complexity and their measures restrict eventual usefulness, which so far seems mainly descriptive.

    And in the middle we have the observations. Btw, all organisms have had the same time (if not always exactly the same rate) to evolve, so IMHO this nicely leaves an a priori likelihood untouched.

  6. Timothy Reeves:

    Your description of computational complexity is interesting. Do you have any references?

    Btw, my impression was that CS distinguished between computational cost in space (memory constraint) and time (time constraint). I thought computational complexity (and their classes) described the former and that it can be traded off for the later?!

    In any case, I'm not sure your description of human 'assembly time' is entirely accurate.

    First, the ova and its surrounding contributes a lot of starting information. It brings the cell machinery and maternal hormones that imprints directions on the fetus early on. Second, the difference in food needs (mainly vitamins I believe) and protein expression between us and much smaller animals isn't all that great. And comparing a child and an adult it seems the main difference from growth is size. :-o

    On behavioral complexity, I believe you may be on to something simple yet powerful. Also, you can proficiently run the argument in reverse; evolutionary 'assembly time' is pretty much the same for all organisms so they should contain pretty much the same "starting information".

  7. For "evolutionary 'assembly time'" substitute "evolutionary 'assembly process'".

  8. We don't take up any proteins from food, normally. Our digestive system hacks them to pieces of 1, 2, or 3 amino acids, and those go from the gut to the blood.

    BTW, Torbjörn, the singular is ovum.

  9. Thanks for your erudite and thought provoking comments Torbjorn. Unfortunately its time for bed here so I’ll have to leave a deeper study of them until tomorrow.

    In the meantime just to remind me that I’m at the top of the food chain, I’ll keep a gun and harpoon by my side in case I meet any stray bears or sharks! Shark and Chips? Yummy!

  10. Thanks, David, I botched more than the usual amount of grammar. That's what a saturday beer gets you. :-P

  11. What I said above is exploratory – I’m not pushing it as a fact, just a subject for exploration. I admit that current ambiguities in defining complexity (as suggested by both Larry and Torbjorn) may scupper all attempts to detect and attribute any special complexity status to humans; the differences in the complexities of the higher organisms may be just too subtle to be picked up by our crude notions of complexity. (As Torbjorn suggested)

    In an attempt to circumvent these issues I have used a necessary condition for complexity rather than a sufficient condition; A necessary condition of life is that it must display a high variety of structure/configuration/behaviour. However, this condition has the disadvantage of widening the definitional net so much that it gives randomness the highest complexity ‘status’. The ‘mutual information’ notion alluded to by Torbjorn tries to eliminate this, by peaking complexity some where between regularity and randomness. But as I am comparing organisms with organisms rather than organisms with say ‘gases’ the mutual information factor is inherent in the action of identifying the organism in the first place – that identification can only take place because of the mutual information that constitutes a group of cooperating molecules and cells.

    So do humans have the greatest variety in terms of some combination of structure, configuration and especially behaviour? Well, assuming they do, then computationally speaking that implies humans will require the greatest number of steps in their construction, and that construction must include the steps in the construction of off-the-peg molecules (bits of protein according to Dave above) taken from other organisms, not to mention the long process of socialization. Hence humans can boast that they are biggest and best on at least one count – the shear construction work!

    In spite of all that it may be that what really sets human biological configurations apart is not their structural variety but rather in some other way that, like a simple yet elegant algorithm, is best expressed as being “just damn clever”. Given the complexity of complexity space can we ever hope to fully mathematicize the notion of being ‘just damn clever’? It could be that my notion of complexity above should dispense completely with the idea of variety as a measure of complexity and perhaps I should fall back on quantity of construction steps only – that is, some biological configurations might be comparatively simple on the variety front but are extraordinarily difficult to find in the many pathways of complexity space because they are the equivalent of some distant far flung backwater that requires a lot of computational fuel to find.

    Technical note to Torborn: The relationship of variety and computation steps is something I’m still thinking about. Hence no references yet. Unfortunately Chaitin, the man I go to for all things algorithmic, seems much less interested in computation time than he does program string length. Program strings map to an output, which may be just a simple yes or no or it may be Omega. In short Chaitin is interested in functions, or ‘halting programs’. I’m interested in developing systems that don’t stop such as evolution or even a simple non-halting counting algorithm that tracks through all configurations. Moreover, what goes on in memory I consider as ‘output’.

  12. Timothy Reeves:

    The correspondence between algorithmic constructions and physics is interesting. Computer scientist Scott Aaronson has a lot to say here. The impression I get is that he claims that physical processes are algorithmic at their base. (Even if that is quantum algorithmic.)

    So you put the finger on a problem here. Algorithmic theory is interested in the resources (time and space) taken to deliver a result. While physics necessarily describes the ongoing process. It will be interesting to see if guys like Aaronson can figure out a more direct correspondence.

    About humans, to be honest I don't see any large difference between humans and other species, and certainly not a qualitative difference.

  13. Thanks for that Torbjorn. I'll look up Scott Aaronson and see what he says.

    Yes, you may be completely right about the differences (or lack of them) between humans and comparably complex organisms. I was just engaging in some rather seat-of-the-pants speculations - that's how I like my science!

    In any case the exercise may be as pointless as trying to compare the complexity of a sports car with a heavy truck - what really distinguishes the two is not so much a difference in complexity measure but a difference in function!

  14. Whatever your proposed function, ask yourself this question: Can I explain why an onion needs about five times more non-coding DNA for this function than a human?

    Or, when dealing with Hovind-level creationists: "If I were to read this argument in The Onion, would it seem out of place?"

  15. HELLO,
    Its amazing how my search about c- paradox bought me to your blog.
    thanx for that dose of information,it helped me understand things well.. although it was a paradox..

  16. I just heard a professor tell 150 undergraduates that as you move up the ladder of evolution and organisms get more complex, the size of the gene increases (due to more introns). The textbook doesn't say this but kind of hints at it by using carefully chosen organisms arranged from bacterium to yeast to Drosophila to human, and hey it sure looks true! I came right back to my computer and looked up this blog post in order to remind myself of the reality and set my portion of the class straight in discussion. It should be required reading. Thank you for writing it.
    (Also: "ladder of evolution"?? what century is this?? ARGH)

  17. Do you think that alternative splicing will have a major impact on the figures people are producing?

    (I have no strong opinions on this subject and certainly don't have a particular axe to grind. I found your post interesting, but I was expecting to see maybe a little on the impact perhaps cladistic variability in amount of alternative splicing (is there any?)? Do you (or others_ believe that alternative splicing is evolutionarily relevant in terms of complexity, for example?

    1. I don't think alternative splicing is very significant. Most of what is referred to as "alternative splicing" is artifact or splicing errors.

  18. The word "complex" tends to mean "many parts" but tells us nothing about the interrelation of those parts, a functional mechanical watch is complex but also a mangled mechanical watch is comnplex, a living animal is complex but also a largely decomposed animal is complex. We are using the same word to mean opposite states! I would purpose we reserve the word "complex" for the definition "composed of many independent parts" and introduce the word "sophisticated" for the definition "composed of many interdependent parts". Complex systems may be predicted by probabilities, but sophisticated systems can only be predicted by a close tally of each part, genes are more sophisticated than complex, the environment is more complex than sophisticated, the issue here is not an argument about that which is more complex but rather it is an argument about that which is more sophisticated