A Look Back at the History of the Human Genome Project

Today’s blog post comes from guest writer Stanislav Volik, who has worked in genomics since the 1990s. His PhD thesis was one of the first genomics theses defended in Russia. Stanislav’s genomics focus is cancer studies, specifically breast and prostate cancer. With two colleagues, he invented and patented a paired-end sequence approach to decipher tumor genomes’ structures in the early 2000s, before NGS made it feasible to sequence a tumor’s DNA.
Read how this work began, which is foundational to how we approach cancer tumor testing at Avitia.
A Look Back at the Human Genome Project
With the most recent release of a complete human genome by the telomere-to-telomere consortium, I found myself reflecting on the history of our collective efforts to achieve a better understanding of our genetic heritage. We could say that this year marks the coming of age for the Human Genome Project (HGP). Twenty-one years ago, the first drafts of the human genome sequence were published by the public National Institutes of Health-led International Human Genome Consortium and commercial entity Celera Genomics, founded by Craig Venter. The “First Draft,” of course, was exactly that — about 90% of the euchromatic (generally gene-rich regions) were analyzed. This prompted a string of follow-up press conferences and articles, describing ever more complete versions of the whole genome sequence. This was until about three years later, on October 21, 2004, when the International Human Genome Sequencing Consortium published the penultimate paper titled “Finishing the euchromatic sequence of the human genome.” By any measure, this is one of the most towering scientific and technological achievements of the late century. One of the most interesting aspects of its completion is the way in which available technology was shaping the strategy and even politics around this monumental endeavor.

HGP: Where it All Began
The HGP timeline is still available on the Oakridge National Laboratory website archive. Even in its current, barely functioning form, it reveals a fascinating story of an idea that seemed impossible when, in 1984, a group of 19 scientists found themselves snowed in at a ski resort in Alta, Utah. They grappled with the problem of identifying DNA mutations in Hiroshima and Nagasaki nuclear attack survivors and their children. Existing methods could not identify the then-expected number of mutations, but the advent of molecular cloning, pulsed-field gel electrophoresis, and other wonders of technology gave everybody the feeling that the solution was possible. Charles DeLisi, the newly appointed director of the Office of Health and Environmental Research at the Department of Energy (DOE), read a draft of the Alta report in October 1985. While reading it, he first had the idea of a dedicated human genome project. Next year the Human Genome Initiative was proposed by the DOE after a feasibility workshop in Santa Fe, New Mexico. In 1987, it was endorsed and the first budget estimate appeared. Finally, in 1990 National Institutes of Health (NIH) and the DOE announced the first five-year plan titled “Understanding Our Genetic Inheritance. The US Human Genome Project”. The project was announced with an approximate annual budget of $200M with a stated goal to complete the sequencing of the first human genome in 15 years with a total of $3B in 1990 dollars — an equivalent of approximately $6B today.

The Maxam, Gilbert, and Sanger Race
In 1985, the concept of sequencing the whole human genome was truly revolutionary scientific thinking at its best, since no appropriate technology was ready for such a task. Four years passed since the 1980 Nobel Prize in Chemistry was shared between P. Berg for his “fundamental studies of the biochemistry of nucleic acids, with particular regard to recombinant-DNA” and W. Gilbert and F. Sanger (the second Nobel Prize for the latter) for “their contributions concerning the determination of base sequences in nucleic acids.” However, it was not quite clear which of Gilbert or Sanger’s approaches to sequencing would prove to be the most efficient. Maxam and Gilbert developed a purely chemical method of sequencing nucleic acids that required many chemical steps, but could be performed on double-stranded DNA. Sanger’s approach on the other hand required single-stranded DNA. In the early and mid-1980s, both methods were still widely used, and the advantages of Sanger’s approach with its reliability — given access to high-quality enzymes and nucleotides — and longer reads, were just being established. Both approaches had limited read length — approximately 200-250 bases for Maxam-Gilbert and 350-500 bases for Sanger — and required the genomic DNA to be fragmented prior to analysis. Given the realities of fully manual slab gel sequencing, this meant that determining a sequence of a single average human mRNA was an achievement worthy of publication in a fairly high-impact journal. With an average time for analysis of a ready-to-sequence DNA fragment of ~6 hours, an average read length of 350-500 bases, and 10-20 DNA fragments analyzed per slab gel, the throughput for a qualified post-doc at that time achieved a whopping 1.7-2.0 kb per hour. With a haploid human genome size of ~3 billion bases, one was looking at the very minimum of 171 years for a single station to sequence perfectly ordered, minimally overlapping fragments that could then be assembled into the final reference sequence.
Mapping it Out
There was one caveat — this set of minimally overlapping genomic DNA fragments did not exist yet. It was not immediately clear if anybody was able to create one or how to order those into a full sequence. Given the fact that the human genome contained numerous highly repetitive sequences that were longer than the average read length of existing technologies, it became apparent that an absolute prerequisite of achieving the stated goal of creating a human genome reference sequence was to have a physical map of the genome. This would need to contain information on the order and physical spacing of some genomic features that could be identified in sequenceable fragments. This would allow ordering for a multitude of the reads necessary to determine the human genome sequence. Consequently, much effort was spent by the broad scientific community over the course of the next 14 or so years — counting from the fateful Alta meeting — to develop ever more detailed sets of human genome physical maps. This also led to more complete libraries of ever larger DNA fragments (clones) that were being produced and mapped back to the genome using more sophisticated molecular biology techniques. This work was very much supported by the scientific community, not only because it was deemed absolutely necessary for the success of the project, but also because it was “fair” — allowing even relatively small groups to meaningfully contribute to the success of this huge endeavor.
Sanger Wins and Gets an Automated
In parallel with the massive efforts to create a comprehensive physical map of a human genome, a lot of effort was focused on streamlining and then automating DNA sequencing. This was initiated in order to drastically increase sequencing throughput. Sanger sequencing won this battle since it proved to be easier to automate — no complicated chemical reactions were required. As an additional bonus, it offered longer read lengths. However, the most important factor was that the biological machinery for DNA synthesis used by this technology proved to be sufficiently robust and versatile. This allowed labeling the nucleotides first with biotin. and later fluorescent dyes, which obviated the need for radioactive labeling. In 1984 Fritz Pohl reported the first method for non-radioactive colorimetric DNA sequencing. In 1986 Leroy Hood’s group published a method for automated fluorescence-based DNA sequence analysis. This technology allowed Applied Biosystems to offer the first automated DNA sequencers (ABI370/373), a machine that enabled massive sequencing projects. This included efforts to catalog all expressed human genes using “Expressed Sequence Tags” (EST). In 1995, another breakthrough instrument was released called ABI Prism 310, which did away with the pesky problem of pouring flawless big and thin (down to 0.4mm thick) gels that greatly simplified and sped up the sequencing process. Finally, in 1997, the ABI3700 capillary sequencer was released, which boasted 96 capillaries. This configuration gave “the 3700 system the capacity to analyze simultaneously 96 samples as often as 16 times per day for a total of 16 × 96 = 1,536 samples a day,” as the ABI brochure touted. In other words, users could expect to receive a whopping 768Kb of sequence daily.
Venter Causes Outrage
This unprecedented sequencing capacity increase suddenly made another approach feasible — de novo sequencing for complex genomes without constructing ordered genomic fragment libraries, and without a long and very expensive process to physically map. This approach came to be known as “shotgun” sequencing. The theoretical feasibility of such an approach was established in 1995 by Leroy Hood’s team. In a paper titled “Pairwise End Sequencing: A Unified Approach to Genomic Mapping and Sequencing,” they demonstrated that a large complex genome can be sequenced using just a collection of randomly cloned fragments of at least two very different sizes, that would be randomly subcloned, sequenced, and ordered based on the identification of these paired-end sequences in the contigs assembled from subclones. A mere two years later, in 1997, Craig Venter, the founder of The Institute of Genome Research and then-Celera Genomics, announced that his team would “single-handedly sequence human genome” in just three years for $300M. This was 1/10th of the originally estimated cost of the public International Human Genome Project.
Needless to say, Venter’s announcement caused an uproar in the genomics community. First, it appeared to obsolete all the huge efforts spent on physical map construction and ordering clone libraries. Second, it put leaders and political supporters of the public HGP in a really bad light: after spending 10x Venter’s budget and working on the project for seven years since its official launch in 1990, their proposed timeline for releasing the draft sequence was still seven years away (2005). And, finally, the scientific community was outraged by Venter’s plans to offer paid access to the genomic sequence to commercial entities. I still remember the charged atmosphere at the Cold Spring Harbor meeting in 1997 when Venter made his announcement. Nobody knew the details. Without the internet, as we know it today, there were only rumors about closed-door talks between NIH and Wellcome Trust. It was very late that day around 10 pm, when Craig Venter came on the podium trying to present his idea. He was essentially booed off the stage by the outraged audience. Then-NIH director Francis Collins and then-head of Wellcome Trust came on the podium and proclaimed that the public HGP won’t be beaten. He said that the Wellcome Trust will devote whatever resources are needed to ensure the “competitiveness” of the public HGP, ensuring that everybody will have free and unfettered access to its results.

Be it as it may, Venter’s initiative did result in a substantial reevaluation of the HGP strategy. In the end, both teams (Venter’s and HGP) ended up using a hybrid shotgun plus physical mapping information for the first human genome assemblies. This resulted in two groundbreaking, simultaneous publications in 2001. The animosity towards Craig Venter didn’t last long in the genomic community — a few years later many of the people who booed in 1997 were applauding his talk to the same audience devoted to the first large-scale metagenomic project.
Final Comments
Looking back over the many years of my professional life, witnessing the first HGP completion was surely the experience of a lifetime. Essentially, the HGP set a new paradigm in biological studies that served as a prime catalyst for developing revolutionary new technologies. These new technologies became tectonic forces in their own right — obsoleting some massive efforts, yet opening many new paths. This pattern continued, with the next topic to be addressed, focusing on the actual genetic diversity of humans and how we can use this knowledge to meaningfully impact our lives. This could not be accomplished using those first-generation sequencing technologies that enabled the HGP’s success. The next phase of breakthroughs followed, which led to the emergence of next-generation sequencing (NGS) technologies, which finally made it a routine to not only sequence individual genomes but allowed us to study single-cell genomes and transcriptomes.
Stay Ahead of Cancer Today
