At a time when the genomes of many species have been sequenced completely, a fundamental
resource expected by many researchers is a simple list of all of an organism's genes. A gene list,
together with associated physical reagents and electronic information, allows one to begin to
investigate the ways in which many genes interact in the complex system of the organism. However,
many species of medical and agricultural importance have not yet been prioritized for
genomic sequencing, and expressed cDNAs have provided the primary source of gene
sequences. Furthermore, when the genomic sequence of an organism becomes available, a collection
of cDNA sequences provides the best tool for identifying genes within the DNA sequence.
Thus, we can anticipate that the sequencing of transcribed products will remain a significant area
of interest well into the future.
The eara of high-throughput cDNA sequencing was initiated in 1991 by a landmark study from
Venter and his colleagues. The basic strategy involves selecting cDNA clones at random and
performing a single, automated, sequencing read from one or both ends of their inserts. They
introduced the term EST to refer to this new class of sequence, which is characterized by being
short (typically about 400–600 bases) and relatively inaccurate (around 2% error). The use of
single-pass sequencing was an important aspect of making the approach cost effective. In most
cases, there is no initial attempt to identify or characterize the clones. Instead, they are identified
using only the small bit of sequence data obtained, comparing it to the sequences of known genes and other ESTs. It is fully expected that many clones will be redundant with others already
sampled and that a smaller number will represent various sorts of contaminants or cloning artifacts.
There is little point in incurring the expense of high-quality sequencing until later in the
process, when clones can be validated and a non-redundant set selected.
Despite their fragmentary and inaccurate nature, ESTs were found to be an invaluable
resource for the discovery of new genes, particularly those involved in human disease processes. After the initial demonstration of the utility and cost effectiveness of the EST approach,
many similar projects were initiated, resulting in an ever-increasing number of human ESTs.
In addition, large-scale EST projects were launched for several other organisms of experimental
interest. In 1992, a database called dbEST was established to serve as a collection point for
ESTs, which are then distributed to the scientific community as the EST division of GenBank.
The EST division continues to dominate GenBank, accounting for roughly two-thirds of all submissions.
