2012-12-10

Evidence for common descent: DNA redundancy

Caveat

I am not a biologist. The experiment I conducted is based on a preliminary knowledge of molecular biology. I welcome criticism and am open to be corrected on misconceptions. 

 Introduction

Anyone who looked into the subject of molecular biology knows that DNA forms some sort of "code" that is translated into proteins. Even creationists do not dispute that. Now the process how this is done is that the DNA is copied to mRNA (messenger RNA) which is identical to the DNA, except that the nucleotide T (thymine) is replaced by U (uracil). The translation mechanism then scans the RNA for a "start codon". A "codon" is 3 consecutive nucleotides (each of which can be A (adenine), U (uracil), C (cytosine) or G (guanine). The start codon is AUG. Whenever this codon is encountered, the mechanism starts translating the RNA one codon at a time to amino-acids and these amino-acids are chained to become the protein.

Since there are 3 nucleotides in a codon, and each nucleotide can be one of 4 possibilities, the number of codons that can be formed is 64 (i.e.. 4x4x4). However, there are not that many amino-acids that we find in our proteins. It so happens that some amino-acids are chosen from multiple codons. A table of which codon translates into which amino-acid can be found in the followin table (copied from the Wikipedia entry for "Genetic code"):



DNA redundancy

The genetic code and resulting proteins differs between organisms. These differences have often been used to create a phylogenetic tree indicating common descent. The objection from creationists is that these differences matter and that the creator made the proteins the way they are necessary for the organism. Even though I don't subscribe to that notion (as there have been experiments where proteins of organisms were replaced by similar proteins of wildly different organisms without a clear drawback), I am willing to grant that for the sake of argument. However, this defense does not hold for a variation in DNA that leads to the identical amino-acid. If common descent is false, we should not see a pattern where organisms that are more closely related according to the evolutionary theory show more similarity  than organisms that are further apart in said tree. There is no reason why there should be a variation in the first place and if there is a variation, it should be assumed to have formed after the creation of the organisms so we should not expect to find a pattern in this variation. If we do find a pattern, we can only come to two viable conclusions: either common descent is true or the creator of the organisms purposefully added a pattern where none was needed. The latter case is was I would qualify as deceptive.

Methodology

Genetic information is freely available on the net. I followed the instructions by YouTube user "C0nc0rdance" in his video "The Joy of Phylogeny: How To Make Your Own Phylogram", to retrieve both protein and mRNA sequences for the same gene in different organisms in FASTA format. For my experiment I edited both files so that similar organisms have identical  FASTA labels in both files (I used easier names like "man", "chimp" and "roundworm" i.o. the official Genus-Species name, but it was also necessary since the protein and mRNA sequences use different FASTA labels. Where multiple versions of the gene were present in the genome, I appended a digit to the label. For the sake of completeness, here is the list of labels that I used for the different organisms and the official Genus-Species name:

bread_mouldNeurospora crassa
chimpPan troglodytes
cotton_mouldAshbya gossypii
cowBos taurus
dogCanis lupus familiaris
fission_yeastSchizosaccharomyces pombe
fowlGallus gallus
fruitflyDrosophila melanogaster
manHomo sapiens
mosquitoAnopheles gambiae
mouseMus musculus
ratRattus norvegicus
rhesus_monkey  Macaca mulatta
rice_fungusMagnaporthe oryzae
riceOryza sativa
roundwormCaenorhabditis elegans
thale_cressArabidopsis thaliana
yeast_klKluyveromyces lactis
yeast_saSaccharomyces cerevisiae
zebrafishDanio rerio

The first step is to find the common amino-acids in the protein. I used the program "ClustalW" to do a multiple alignment of the protein sequences. ClustalW adds a "summary" line indicating the similarity between amino-acids in the same position among the organisms. If all organisms share an identical amino-acid in the same position, it is denoted with an asterisk (*).

I wrote a perl script that will take both the output of the protein alignment as well as the mRNA file as input and will generate a new FASTA file containing only those RNA segments that code to amino-acids that are identical among all organisms. By building a phylogenetic tree based on these strings, only the variation between codons that code to the same amino-acids is measured. I used ClustalW to align the generated strings and calculate a phylogenetic tree. The resulting trees are screen dumps of the TreeView program.

The proteins I used are GAPDH, Actin Beta, Actin Gamma, Catalase, HPRT and UBE2K.

Resulting trees



GAPDH
Actin, Beta

Actin, Gamma
Catalase
HPRT
UBE2K

Conclusion

Even though the data set is not very large, a clear pattern emerges that organisms that are more closely related actually are more similar in the redundant nucleotides of the DNA. As far as I can tell, the necessity for this cannot be explained given a creation model where species do not share a common ancestry.

Data availability

The data used for this experiment has been retrieved from the HomoloGene database from NCBI webpage (select HomoloGene from the pulldown menu and search for the protein name). A zip file containing the RNA and protein sequences, the ClustalW output, the script output and the ClustalW output on that for all genes displayed here, as well as the script itself can be found via this link. Content of the zip file: the perl script is named "sameacid.pl". For each gene we have the following files:
<genename>_prot.txt: protein sequence, retrieved from NCBI
<genename>_rna.txt: mRNA sequence, retrieved from NCBI
<genename>_prot.aln: protein alignment output from ClustalW
<genename>_prot.dnd: protein tree output from ClustalW (Newick format)
<genename>_strip.txt: mRNA subset, output from sameacid.pl
<genename>_strip.aln: mRNA subset alignment by ClustalW
<genename>_strip.dnd: mRNA subset tree by ClustalW (Newick format)

The reader is encouraged to try this with other genes, preferably long genes with a high conservation among organisms.