Why Biologists Can’t Count: An Overview of the Gene-Finding Problem

Five years after draft assemblies of the human genome became available the number of genes it contains remains uncertain. Parallel approaches yield divergent estimates of the total gene number, and it is not yet clear how to rationalize this variability. To illustrate the challenges involved in identifying and counting genes, I overview three major approaches: sequence-analysis, transcriptomic analysis, and genome-wide localization of the basal transcriptional machinery.

In the summer of 2000, the race to sequence the human genome was reaching its peak. The evening news broadcasts were tracking the race between the private and public consortiums. Business publications ran articles predicting the start of a “golden age” for biotech. Politicians focused on the long-term medical breakthroughs that sequencing would yield, perhaps as a tangential way of justifying the cost to their constituents.

That summer, I was crossing into the United States. As usual, the customs agent asked where I was going. When I explained that I was biologist headed to a university in Michigan, he asked: “So, you’re off to understand that whole genome thing, eh?”

That was a remarkably perceptive question. Media coverage was focusing on the medical and financial benefits. Nevertheless, the immediate challenge facing researchers was to understand the genome. There were a host of simple questions that needed to be answered. How many genes are there? Where are they? How many of them are conserved in mice? In rats? In chimpanzee? And, most importantly, what exactly makes a gene a gene in the first place?

Five years later, we still cannot answer these basic questions. This review aims to demonstrate why the answers to these questions have remained elusive by showing how hard it is to get consistent estimates of gene-numbers. Three of the major approaches for identifying and counting genes will be described: in silico gene searching, transcriptomic searches, and genome-wide localization studies.

Approach #1: Sequence-Searching
In the early days of genome-sequencing many people believed that the combination of highpowered computers and DNA sequence would be sufficient to identify all genes. It was thought that the characteristic traits of a gene – TATA boxes and INR elements, codon-biases and splice-site sequences – would be sufficient to define all genes in silico. This approach might be called “structure-based gene-finding” (1).

Most groups developing these structure-based gene-finders chose not to precisely define the exact structure of a gene. Instead, they first compiled a list of structural features, like splice acceptor/donor sites. These lists of features were then fed into a very general statistical model termed a “Hidden Markov Model” (HMM). These HMMs were given a large set of annotated genes, which they could use in a process called “training” (2). During “training”, the HMMs would “learn” from real data how important each structural feature truly was in predicting genes correctly. By attaching statistical parameters to each structural feature, the HMM could quantify this importance, allowing for estimations of the probability that a given region contains a gene. This “trained” HMM could then be used to predict new genes de novo from genomic sequence (3, 4).

This basic strategy is followed by many genefinding programs. Despite this underlying similarity, however, different programs frequently yield wildly different results, often as a result of differences in how they formulate or train their HMMs. For example, the UCSC (University of California, Santa Cruz) Genome Browser includes data (termed “tracks”) for several gene-finding programs. On the most recent build (hg17; May 2004) one widely common program called GENSCAN predicts 42,807 genes while another HMM-based program called Augustus predicts 36,314 (5) – a difference of almost 6,500 genes!

When additional eukaryotic genomes were sequenced, it became possible to improve structure-based gene-finding programs. Evolutionary-relatedness is a powerful tool in bioinformatics because it allows software to focus on regions of sequence that are most likely to contain true hits and least likely to harbor false-positives. For example, when searching for transcription-factor binding-sites it has been shown that the 20% of DNA best conserved between human and mouse contains over 90% of the (known) functional binding sites (6).

In 2001, Michael Brent’s group at Washington University published a new gene-finding program called TWINSCAN (7). Derived from the well-established GENSCAN algorithm, TWINSCAN incorporates inter-genome homology into its HMM. Using this additional information, TWINSCAN improved the prediction accuracy by nearly 60%. Yet despite this improvement, only about 25% of known genes were predicted perfectly — structural errors were made in the predictions for the remaining genes. This low success rate is largely thought to result from the huge variability inherent in gene-structures. In particular, single-exon genes have proven notoriously difficult to predict (1, 8).

On the current build of the human genome, TWINSCAN predicts 25,633 genes – over 10,000 fewer than the structure-only predictions of GENSCAN or Augustus. While it seems very likely that TWINSCAN predictions are more accurate (reduced false positives) than those of other programs, it is not clear just how many genes are being missed (false negatives) by the use of stringent evolutionary criteria.

So while sequence-based gene-prediction seems ideal, in practice, different gene-finding programs predict very different numbers of genes, and are highly error-prone in defining the exact genomic structure of a transcribed region.

Approach #2: Identifying Transcribed Regions
Perhaps the previous section suggests a natural alternative to error-prone gene-finding programs: experimental delineation of transcribed regions. Ultimately, a gene is a transcribed region of the genome, so experiments that identify all transcribed regions should accurately identify all genes in a genome.

There are three major approaches to identifying genes by profiling transcribed regions: array approaches, SAGE (series analysis of gene expression) experiments, and mining of EST (expressed sequence tag) libraries. For brevity this review will focus on the array approaches, but this is not meant to indicate that they are either more reliable (9) or more common than EST (10-12) or SAGE (13) experiments.

Theoretically there are two ways to assess the transcriptome of a cell with microarrays. First, one could construct an array with all possible sequences of n base-pairs. This type of array, called a universal n-mer array, could theoretically be used to assess the transcriptome of any organism, independent of the sequencing of its genome (14). Despite their efficiency, such arrays are not yet commercially available. Further, I am not aware of any publications exploiting them for studying higher eukaryotes.

Instead, the transcriptome has largely been studied with so-called “tiling arrays”. These arrays – or more correctly series of arrays, since it can take over 100 arrays to cover an entire genome (15) – tile across the human genome with sequences that either overlap slightly or simply neighbour one another closely. By covering the entire genome, any transcribed region can be detected as hybridizing to the array. It should be noted that repetitive regions of the genome (as much as 50% of total genomic sequence depending on the definition of “repetitive”) are excluded from tiling arrays.

A number of different tiling experiments have been performed over the past several years, employing several different array platforms (reviewed in (16)). Some of these studies have been limited to chromosomes 20, 21, and 22 so their gene estimates have been scaled up to the entire genome (17).

The different tiling studies are as discordant in their predictions of gene-number as the different gene-prediction programs were. For example, a group at Rosetta Inpharmatics combined computational and array approaches to make an estimate of 25,000 to 30,000 total genes (17) based on the study of transcription on two chromosomes under a range of conditions. Yet a separate approach employed by the Synder lab at Yale focused on genomewide hepatic transcription and identified 13,889 transcribed exons (15). Surprisingly, only about 5,000 corresponded to known exons and the remainder was novel. While much of this may be a result of alternative splicing, the magnitude of novel transcribed regions suggests that there may be well over 10,000 hepatically-expressed genes, leading to an overall gene number of between 30,000 and 40,000.

It would be very helpful to compare transcribed regions across species, but tiling studies are quite expensive and only a handful of extensive murine transcriptome studies are available (18, 19), none of which employ tiling arrays.

Approach #3: Identifying Binding of the Transcriptional Apparatus
A third approach to identifying genes is to consider the cellular information that lies between raw genomic sequence and actual transcription – the regulatory structures that lead to transcription. Two recent papers by a group at UCSD have employed genome-wide localization studies (ChIP-on-chip) to identify binding of the pre-initiation complex (PIC) that is thought to be assembled on every active promoter in the cell (20, 21). This approach identifies the specific genomic regions to which the transcriptional apparatus binds. However, it only identifies binding and cannot determine whether or not transcription actually occurs, how long a gene product might be, or what the underlying gene structure is.

In their first paper, the group focused on the regions specific to the ENCODE project (22), which covers about 1% of the genome. In a single cell-line (IMR90) they detected about 252 PIC binding-sites. Given the presence of multiple transcription-start sites for many genes, this implies approximately of 15,000 distinct transcribed regions in the entire genome in this single cell line. It is challenging to extrapolate an estimate of gene-number from a single cellline to the entire organism because many PICs may exist pre-charged on the chromatin, but remain transcriptionally inactive until required (20). One guess might be that a third of all genes are active in this cell-line, leading to 45,000 distinct PIC binding sites. Given the frequency of multiple start-sites this might represent about 30,000 genes.

To help clarify the genome-wide applicability of their results, the group then extended their analysis using genome-wide arrays that covered nearly the entire genome (21), again using IMR90 cells. After collapsing multiple startsites, this study suggested about 10,000 distinct promoters for about 8,000 unique genes. While the statistical analysis between the two studies was somewhat different, the striking variability between the two studies suggest that extrapolating predictions from the ENCODE region to the entire genome must be done cautiously.

Most intriguingly, the authors suggest that their results imply that about 10% of all genes have yet to be identified – leading to a prediction of about 20,000 to 25,000 total genes. They suggest that some of the discrepancy between their estimate of only 20,000 binding sites for the PIC complex and the estimates of 30,000 distinct transcribed regions may be the result of weak or highly transient binding. But if transient binding of the PIC is important, then what is the nature of these sites, and can they be detected by sequence-search algorithms? Further, are these transient bindings regulated, or are they simply random occurrences? And to what extent are these binding sites conserved across species? Answers to these types of questions will require several more genome-wide localization screens across a range of human and mouse tissues.


A brief outline of some of the estimates of gene-number is found in Table 1. The huge degree of variability remains unaccounted for, and raises a number of critical questions. Ultimately these questions revolve around our definition of the term “gene”. To accurately count genes, we need to determine what precisely they are. Is it sufficient that a region has the potential to be transcribed, or must it do so in a regulated fashion? Must the transcript be functional?

Specific Method Estimate Comments
Manual Annotation RefSeq Database 19,784 Distinct TSSs
Sequence Analysis GENSCAN 42,807 Considers structures only
Augustus 36,314 Considers structures only
TWINSCAN 25,633 Incorporates homology
Transcriptome Analysis EST Sequencing 27,000 Ref #12
Long SAGE 23,000 Ref #13
Tiling Array 30,000-40,000 Ref #16
PIC Localization ENCODE Region 30,000 Ref #20
Whole Genome 20,000-25,000 Ref #21
Table 1: Summary of Gene Number Estimates. This table summarizes gene-number estimates from nine different studies, scattered across the three basic experimental classes discussed in the text. All estimates fall into the range of 20,000 to 40,000 genes, but are widely distributed within this space.

The complexity of these questions makes it likely that several more years will pass before we can say just how many genes there are, and many more after that before we can really begin to “understand that whole genome thing.”


1. V. Makarov, Brief Bioinform 3, 195-9 (Jun, 2002).

2. R. O. Duda, P. E. Hart, D. G. Stork, Pattern classification (Wiley, New York, ed. 2nd, 2001).

3. R. N. Mantegna et al., Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics 52, 2939-50 (Sep, 1995).

4. S. R. Eddy, Curr Opin Struct Biol 6, 361-5 (Jun, 1996).

5. D. Karolchik et al., Nucleic Acids Res 31, 51-4 (Jan 1, 2003).

6. W. W. Wasserman, M. Palumbo, W. Thompson, J. W. Fickett, C. E. Lawrence, Nat Genet 26, 225-8 (Oct, 2000).

7. I. Korf, P. Flicek, D. Duan, M. R. Brent, Bioinformatics 17 Suppl 1, S140-8 (2001).

8. M. Scherf et al., Genome Res 11, 333-40 (Mar, 2001).

9. S. J. Evans et al., Eur J Neurosci 16, 409-13 (Aug, 2002).

10. T. P. Larsson, C. G. Murray, T. Hill, R. Fredriksson, H. B. Schioth, FEBS Lett 579, 690-8 (Jan 31, 2005).

11. V. Brendel, L. Xing, W. Zhu, Bioinformatics 20, 1157-69 (May 1, 2004).

12. T. Ota et al., Nat Genet 36, 40-5 (Jan, 2004).

13. S. Saha et al., Nat Biotechnol 20, 508-12 (May, 2002).

14. R. M. van Dam, S. R. Quake, Genome Res 12, 145-52 (Jan, 2002).

15. P. Bertone et al., Science 306, 2242-6 (Dec 24, 2004).

16. J. M. Johnson, S. Edwards, D. Shoemaker, E. E. Schadt, Trends Genet 21, 93-102 (Feb, 2005).

17. E. E. Schadt et al., Genome Biol 5, R73 (2004).

18. W. Zhang et al., J Biol 3, 21 (2004).

19. A. I. Su et al., Proc Natl Acad Sci U S A 101, 6062-7 (Apr 20, 2004).

20. T. H. Kim et al., Genome Res 15, 830-9 (Jun, 2005).

21. T. H. Kim et al., Nature (Jun 29, 2005).

22. Science 306, 636-40 (Oct 22, 2004).

Share your thoughts

Leave a Reply

You must be logged in to post a comment.