A Computational Analysis of Hybrid Genome Assembly Strategies
Document Type Restricted
This paper is restricted to users on the Connecticut College campus until May 22, 2024.
The central dogma of molecular biology states that DNA is transcribed to RNA and then translated into proteins. Since DNA is the starting material for many of biology’s macromolecules, it has been referred to as “nature’s instruction book.” The sum of all DNA in a cell is referred to as the genome, and genome sequencing is how we interpret the DNA.
Due to limitations on currently available technology, it is not possible to retrieve the entire genome in one contiguous set of data. Therefore, genome sequencing is a computer science problem as sequencing “reads” must be stitched together to obtain the complete genome sequence. There are two types of reads: short, accurate ones and long, inaccurate ones, and it is currently unclear how to most efficiently combine them. This is especially true of large genomes, where the cost of data acquisition is more expensive and the assembly step is harder. Therefore, we were motivated to simulate the process of genome sequencing on various organisms and then to reassemble their genomes based on varying levels of short and long read coverages.
Our results, while incomplete due to the nature of genomic data, show that approximately 25X short, accurate read coverage and 14X long, inaccurate read coverage are sufficient to assemble most large (>100 Mbp) genomes. Critically, the amount of coverage required stays relatively constant, even as genome size increases by over an order of magnitude.
This surprising find suggests that large genomes may be slightly easier to assemble than previously thought. As the cost of sequencing continues to fall the bioinformatics community should continue to heavily invest in the field of genomics, hopefully aided by our results to do the most efficient work possible. 1
The views expressed in this paper are solely those of the author.