An Introduction to Effective BLASTing

Abstract
Sequence-alignments in general and BLAST searches in particular have become a ubiquitous part of molecular biology. Despite its popularity, the vast array of BLAST tools and parameter choices can overwhelm the user. Yet accepting the default parameters can greatly reduce search sensitivity and accuracy. This review focuses on the major parameters for BLASTN and BLASTP searches, and discusses both their default values and how they can be tweaked to enhance query results.


Introduction
Computational biology can be defined as the use of quantitative, mathematical models to study biological questions (1). This covers a broad spectrum of research questions, ranging from phylogenetic studies (2) to the discovery of new genes (3) and splice-variants (4), and from the prediction of transcription-factor binding-sites (5) to the integration of large transcriptomic and genomic datasets (6).

With such a diverse set of questions, it might be surprising that there is a common “battery” of computational and mathematical techniques used in their solutions. In a way, this is akin to standard molecular techniques like PCR or Western Blotting, which are broadly applied to answer many distinct research questions from a molecular perspective.

This “toolbox” of computational techniques includes pattern-recognition techniques like clustering (7), sequence-modeling procedures like Hidden Markov Models (8), and a wide-range of statistical and mathematical procedures (9-11).

Perhaps the most ubiquitous computational technique, however, is sequence-alignment. The most common sequence-alignment program, NCBI BLAST, is used tens of thousands of times each day (12).

This review aims on introducing the reader to the many parameters available for tuning and effectively using NCBI BLAST. Following a brief overview of BLAST, the two canonical forms of BLAST are introduced. The parameters for each form are detailed, and recommendations on parameter selection are given.

The Problem of Sequence Alignments and the BLAST Solution
Sequence-alignments are a core element of the computational biology tool-box and are extensively used to study the primary structure of proteins and nucleic acids. Fundamentally, a sequence-alignment is a way of comparing sequences to one another. Thus, sequence-alignments can find use whenever sequences are being studied. Typical uses of sequence-alignments include the identification of cDNA clones in a library (13), the discovery of splice-variants in large sequence-databases (14), functional characterization of uncharacterized genes (15), and evolutionary studies of specific proteins or genes (16).

Regardless of the application, sequence-alignments are a way of determining “how similar” sequences are to one another. Regions that are similar can be overlapped, or “aligned”. The graphical display of this alignment gives the technique its name. Sequences that are very similar will show “strong” alignments, meaning that few mismatches or gaps exist. Algorithms exist to align pairs of sequences (pair-wise sequence alignment) as well as to align larger numbers of sequences (multiple sequence alignment). This review focuses exclusively on pair-wise sequence alignments.

Because of its similarity to classical computer science problems, pair-wise sequence-alignment has been extensively studied. An optimal algorithm exists to align two sequences to one another based on a computational technique called dynamic programming. Unfortunately these optimal alignments – often called Smith/Waterman alignments – are extremely slow. Even on very fast computers, comprehensive database searches using optimal Smith/Waterman alignments can run prohibitively slowly (17).

This is where BLAST comes in. The basic local alignment search tool is based on a statistical approximation used to speed-up Smith/Waterman alignments. By assuming that the best local alignment will contain a small, exact match (Figure 1) the execution time of large database searches can be dramatically reduced (17).

Figure 1: Overview of the BLAST algorithm. The most common use of BLAST is to find the best matches to a probe sequence in a large database (step 1). For example, if a novel cDNA clone is isolated from a library, it can be identified by using BLASTN against a transcriptomic database like dbEST. The first step in a BLAST search is to identify those sequences that have short, exact matches with the probe sequence. In this case, the longest exact match is underlined for each sequence in the database. Only those sequences with exact matches longer than 3 base pairs are carried on for full alignment (step 3). In this case two weak alignments (score = 2) would not be detected by the BLAST algorithm.

Further, based on a statistical advance by Karlin et al. (18) the original BLAST software was able to provide estimates of statistical-significance. In other words, it was able to say “how likely was this alignment to happen by chance”. This is given by the “E-value” for a BLAST alignment. The E-value estimates how many times a similarity this strong would occur by chance alone in a search of this database.

Different Flavours of BLAST
The original BLAST program (17) was limited to comparing protein or DNA sequences against larger databases. Overtime, however, many specialized versions of BLAST have been developed, including PSI-BLAST (12) and Blast-2-Sequences (19). At the time of writing, the main BLAST webpage at NCBI provided no less than 25 different “flavours” of the BLAST algorithm.

Despite this variety, two major versions of BLAST remain the most widely used. First, BLASTN is used to compare a nucleotide sequence against a database of nucleotide sequences. Second, BLASTP is used to compare a protein sequence against a database of protein sequences. The following two sections review the common uses and the major parameters of each of those programs.

BLASTN: Comparing Nucleotide Sequences Program Overview
Nucleotide alignments are occasionally used in phylogenetic studies, but the three main uses of pair-wise nucleotide alignments are sequence identification, primer design, and genomic mapping.

Sequence identification is often the end result of functional screens or expression array experiments that identify one or more sequences associated with a given phenotype. A BLASTN search is then used to identify the gene or transcript corresponding to the experimentally identified sequence (13).

In designing primers for PCR studies it is critical that the primers have minimal cross-reactivity and only anneal to a single template. BLASTN searches are used with the candidate primer sequences to identify potential cross-hybridization problems (20, 21).

Genomic mapping is necessary both for characterizing the results of some types of high-throughput polymorphism studies (22) and for interpreting the results of ChIP-Chip experiments (23).

For all these applications, standard BLASTN is the best tool to use. The other major nucleotide BLAST tool offered by NCBI is megablast, which is mainly used in assembly and searching of genomic trace sequences.

Major Parameters
Database: There are three major nucleotide databases that can be searched: dbEST, RefSeq, and Genomic databases. Expressed Sequence Tags (ESTs) are generated by singlepass sequencing of mRNAs. While this singlepass sequencing may introduce errors, EST databases are extremely large and thus provide a way of looking at the transcriptome of many species whose genome has not yet been sequenced: at writing, there were 416 species with at least 1000 EST sequences available. The RefSeq databases are also derived from mRNA sequences, but are manually curated by NCBI workers to ensure that they accurately represent a single gene. Genomic databases present fragments of genomes from whole chromosomes, as well as trace-files and partial contigs from the genome-assembly process.
Default Value: non-redundant database (nr)
Rationale: nr contains portions of all three major nucleotide databases, making it useful for almost any BLASTN search
Tuning Suggestion: Pick the database most suitable for the search in question. For transcriptomic searches in well-characterized genes use a RefSeq database. For transcriptomic searches in poorly-characterized genes or species use an EST database. These specialized searches will both find some matches not present in the nr database as well as avoid spurious matches found by mixing genomic and mRNA sequences.

Organism: The option exists to limit any BLAST search to a specific species or family.
Default Value: Search all organisms
Rationale: Searching all organisms maximizes the number of hits found.
Tuning Suggestion: It is almost always appropriate to specify a single organism or group of organisms. This reduces the number of low-sensitivity or uninformative hits returned. Further, it can dramatically speed BLAST execution time. Species and families are identified by their latin names, such as Rodentia (rodents), Mammalia (mammals), Homo sapiens (humans), Mus musculus (mouse) and Rattus norvegicus (rat).

Expect: The expect parameter is like a p-value threshold: it gives the least sensitive hit to be returned by BLAST. The number indicates the number of times this hit could have occurred by chance in searching the database. Longer matches will inherently be less likely to occur by chance, and thus have lower expectation.
Default Value: 10
Rationale: An expectation of 10 is capable of detecting most long matches as well as some short, inexact matches.
Tuning Suggestion: Many BLAST searches return hundreds of hits. It can be helpful to reduce the expect to 0.001 to remove lower-quality hits. This can somewhat speed up BLAST execution. When searching with short sequences such as PCR primers it is occasionally necessary to increase the expectation to find inexact hits. For example, identifying an exact 12 bp alignment can require an expect as high as 25.

Word Size: The first step of the BLAST algorithm is to find short exact matches between the search sequence and each sequence in the database. Complete alignments are then performed only on sequences from the database that contain a short exact match (Figure 1). This greatly speeds up BLAST execution, but will occasionally miss some hits, especially with shorter query sequences. The length of the exact match required is called the “word size”. A word-size of 1 is essentially identical to a comprehensive (but slow) Smith-Waterman alignment.
Default Parameter: 11
Rationale: A word-size of 11 allows searches to execute reasonably rapidly, but will clearly miss some relevant hits (17, 24).
Tuning Suggestion: The word-size should always be reduced to the lowest possible value (7 for nucleotide BLASTs) to maximize sensitivity. This selection is particularly critical for short probe sequences.

BLASTP: Comparing Protein Sequences Program Overview
The uses of protein alignments are quite different from those of nucleotide alignments. Alignments are rarely used to identify proteins –a major exception is in large-scale mass-spectroscopy experiments. Protein alignments are much more frequently used in phylogenetic studies than nucleotide sequences (25). In addition, functional analyses are commonly used to characterize protein sequences.

Functional analyses involve searching a protein sequence for either conserved domains (26) or for homology to proteins of known function. For example, if the function of a human protein is unknown, but it has strong homology to a murine protein of known function, then a hypothesis about the function can be made. This general approach has been used extensively in recent years to allow cross-species prediction of protein-complexes and protein-protein interactions (27).

NCBI offers five distinct types of BLAST-based protein-protein sequence alignment tools. Both PHI-BLAST and PSIBLAST involve “profiles” characterizing a family of proteins and are used in some phylogenetic studies (28). Both rpsblast and cdart are specialized to identify conserved domains and functional motifs. Despite these options, for most protein-alignments, BLASTP remains the most appropriate choice.

Major Parameters
Database: There are two major classes of databases available for protein searches. The RefSeq database is, again, a highly curated database of protein sequences. The PDB database contains all protein sequences whose 3D structure has been solved and is available.
Default: As with nucleotide searches, the default is the highly inclusive non-redundant (nr) database.
Rationale: To encompass all known proteins
Tuning Suggestion: If searching only for well characterized proteins consider restricting the search to the RefSeq database. For most other applications nr is appropriate for protein searches.

Do CD Search: The BLASTP program gives the option of simultaneously searching the sequence for conserved domains (CDs) or functional motifs.
Default: Yes
Rationale: To provide as much data as possible
Tuning Suggestion: Leave the CD search enabled. The search only adds a marginal performance penalty, and indeed the CD search usually returns its results well before the BLASTP search, thus giving the user something to start interpreting immediately.

Species: As with BLASTN, the option exists to specify which species should be considered. Only sequences from the specific will be aligned with the probe sequence.
Default: All species
Rationale: Maximize sensitivity
Tuning Suggestion: As with BLASTN (see above) choosing a specific species can greatly improve execution speed and remove spurious hits, leaving a much more easily interpreted result.

Expect: As with BLASTN searches (see above) the expectation value serves as a threshold. Any hits less significant than this expectation value will not be returned by the program.
Default Value: 10
Rationale: This number is something of a compromise between long query sequences (for which it returns many poor matches) and short query sequences (for which it may remove some informative matches).
Tuning Suggestion: As with BLASTN searches it is often absolutely necessary to increase the Expect to identify short matches. For longer matches, it can be helpful to reduce the expect to return a more manageable number of hits, but this is not critical.

Word Size: As with BLASTN searches, the word-size reflects the initial filtering size in a BLASTP search (see Figure 3).
Default: 3
Rationale: A compromise between execution time and search sensitivity
Tuning Suggestion: As with nucleotide alignments it is always beneficial to reduce the word-size to the smallest value possible (2 for BLASTP searches). The increase in execution time is usually compensated by specifying a species, and the additional true hits returned can be of great biological importance.

Figure 3: The Perils of Filtering. A portion of the BLASTN alignment between two isoforms of the Mxi1 gene (RefSeq mRNA accessions: NM_130439 and NM_005962). The alignment was repeated twice, once with filtering (a) and once without (b). A repetitive AC-rich region of 25 nucleotides (underlined) has been filtered out (a), but nevertheless provides an exact match with the alternate isoform (b). In cases like these, filtering can obscure true alignments and can be removed.

Matrix & Gap Costs: These are the core elements of the “scoring system” in a protein alignment. Recall that the goal of a local sequence-alignment algorithm is to compare two sequences and identify similar regions. One key to solving this problem is defining “what makes two sequences similar”. This definition of sequence-similarity is embedded in the “scoring system”, and has two major parts: substitution-scores and gap penalties (Figure 2). When two residues are aligned together a score is assigned based on how similar they are believed to be. Exact matches and very conservative mismatches are given positive scores, while mismatches receive negative scores. The magnitude of the score is a reflection of how conservative or radical a change might be, and the full set of scores are stored in a table called a “scoring matrix” or a “substitution matrix” (29).

In some cases a residue in one sequence has no matching residue on the other sequence (Figure 2). This is called a gap; gaps receive negative scores to penalize this lack of similarity between the two sequences. The penalties assigned to gaps are typically “affine”. This means that a gap is penalized twice: once for existing and once based on its length. The penalty for gap existence is usually larger than the gap extension penalty, reflecting the idea that the insertion or deletion that leads to a gap could easily involve multiple residues (30).

Figure 2: Overview of the BLAST scoring system. The core of a sequence-alignment algorithm is the scoring system, which answers the question “what is a good match?”. Each pair of aligned residues is given a score: matches receive positive scores and mismatches usually) receive negative ones. The scores can be calculated in a number of ways, both probabilistically and heuristically, and are typically stored in a “scoring matrix”. Gaps between the two sequences receive negative scores, typically with a large penalty for the presence of a gap and a smaller penalty for each additional residue in the gap. The total score for the alignment is obtained by adding all the positive (match) and negative (mismatch and gap) scores. The overall score can then be compared to a distribution to determine statistical significance.

The selection of a scoring system is critical in any sequence-alignment. For example, careful selection of the substitution parameters has proved invaluable when working with membrane proteins that have unusual amino-acid compositions (31).

While DNA-based scoring systems are fairly simple (24, 32), protein-based systems can be highly complex. Substitution matrices can be based on an estimated evolutionary distance, and a matrix optimized for identifying very similar proteins may not work out for identifying weak similarities. Common matrices include the PAM and BLOSUM series. Empirically some matrices, e.g. BLOSUM62, appear to be better for weaker alignments, while the PAM series (PAM30 and PAM70) are thought to be superior for shorter query sequences. Similarly, the penalties assigned to the opening and extension of a gap can be adjusted. Smaller penalties can be used to detect weaker similarities that may have diverged through the insertion or deletion of significant regions.
Default: BLASTP defaults to using the BLOSUM62 matrix with a large penalty for opening a gap and a small one for extending it.
Rationale: Most BLASTP alignment searches involve longer sequences, and BLOSUM62 is optimal for these searches.
Tuning Suggestions: For many cases the BLOSUM62 matrix is appropriate. For very short sequences, the PAM30 matrix may be more sensitive. Similarly, gap penalties should normally be increased for longer or more closely related sequences, as this reflects the reduced likelihood of an insertion.

An Aside: Filtering
One other option available for all BLAST programs is the use of a filter. This filter prevents low complexity regions from driving the overall alignment. For example, in proteins, acidic- or proline-rich regions would be removed from consideration, while for DNA, poly-A regions and highly repetitive sequences are masked out. Unfortunately this filtering can occasionally remove interesting regions (Figure 3). It is very difficult to identify those rare cases where filtering is harmful, but if a BLAST query returns absolutely no hits it is possible that filtering is the culprit.

Summary
The original BLAST algorithm became popular because of its speed-advantages, free availability on NCBI servers, and improved statistical estimations. Just as important, the algorithm has been extensively studied and improved over the years. Statistical estimation has been improved (24, 30, 33) and new types of searches have been introduced (12, 19, 28, 34). This continued development has extended the scope of BLAST, and is continuing with enhanced integration between BLAST results and genomic annotation and resources. BLAST can be expected to remain a critical tool for solving computational biologists problems well into the future.

Top 3 Tips for Effective BLASTing

  1. Minimize word-size
    Always use the smallest word-size possible, as larger values may miss real, biologically relevant hits.
  2. Specify a Species and Database
    The default options search most sequences for every species. By specifying these options, the number of uninformative hits drops dramatically, as does execution time. This is particularly important for nucleotide searches, where the default nr database includes both genomic and transcriptomic sequences.
  3. Record & Repeat
    Trying to repeat a BLAST search can be a frustrating experience. The continual addition of sequences to public databases can result in new hits arising and make older results harder to find. It is helpful to record relevant accession numbers, the BLAST parameters you used, and the date on which you performed your search. Repeating a query in the future may identify novel hits that were not previously available in sequence databases.

The parameters discussed here are generally applicable beyond just BLASTP and BLASTN and extend to most flavours of BLAST. By careful tuning, BLAST query results can be greatly improved, and this critical tool can be used more effectively.

Other Sources of Information
Sequence-alignment is a critical part of computational biology. This review focused only pair-wise alignment. There are several good reviews of multiple alignments, including (25). A recent text by Durbin et al. gives an excellent theoretical and mathematical introduction to the field of sequence-alignments beyond BLAST searches (35).

Acknowledgments: The author thanks the two anonymous reviewers for helpful suggestions.

References

1. D. Noble, Nat Rev Mol Cell Biol 3, 459 (2002).

2. S. L. Baldauf, Trends Genet 19, 345 (2003).

3. A. Siepel, D. Haussler, J Comput Biol 11, 413 (2004).

4. G. Yeo et al., Genome Biol 5, R74 (2004).

5. W. W. Wasserman, A. Sandelin, Nat Rev Genet 5, 276 (2004).

6. C. H. Kim et al., Proteomics 3, 2454 (2003).

7. R. O. Duda, P. E. Hart, D. G. Stork, Pattern classification (Wiley, New York, ed. 2nd, 2001).

8. S. R. Eddy, Curr Opin Struct Biol 6, 361 (1996).

9. C. Workman et al., Genome Biol 3, research0048 (2002).

10. R. Jansen et al., Science 302, 449-53 (Oct 17, 2003).

11. G. Didier et al., Bioinformatics 18, 490 (2002).

12. S. F. Altschul et al., Nucleic Acids Res 25, 3389 (1997).

13. R. G. Halgren, et al., Nucleic Acids Res 29, 582 (2001).

14. T. P. Larsson, et al., FEBS Lett 579, 690 (2005).

15. S. Khan, et al., Bioinformatics 19, 2484 (2003).

16. J. W. Thornton, E. Need, D. Crews, Science 301, 1714 (2003).

17. S. F. Altschul, et al., J Mol Biol 215, 403 (1990).

18. S. Karlin, S. F. Altschul, Proc Natl Acad Sci U S A 87, 2264 (1990).

19. T. A. Tatusova, T. L. Madden, FEMS Microbiol Lett 174, 247 (1999).

20. P. C. Boutros, A. B. Okey, Bioinformatics 20, 2399 (2004).

21. M. Lexa, J. Horak, B. Brzobohaty, Bioinformatics 17, 192 (2001).

22. R. Sachidanandam et al., Nature 409, 928 (2001).

23. L. E. Heisler et al., Nucleic Acids Res 33, 2952 (2005).

24. D. J. States, W. Gish, S. F. Altschul, METHODS: A Companion to Methods in Enzymology 3, 66 (1991).

25. A. Phillips, D. Janies, W. Wheeler, Mol Phylogenet Evol 16, 317 (2000).

26. A. Marchler-Bauer et al., Nucleic Acids Res 31, 383 (2003).

27. K. R. Brown, I. Jurisica, Bioinformatics 21, 2076-82 (2005).

28. D. T. Jones, M. B. Swindells, Trends Biochem Sci 27, 161 (2002).

29. M. R. Gribskov, J. Devereux, Sequence analysis primer, UWBC biotechnical resource series (Stockton Press ; Macmillan Publishers, New York; Basingstroke, Hants, England, 1991).

30. S. F. Altschul, W. Gish, Methods Enzymol 266, 460 (1996).

31. T. Muller, S. Rahmann, M. Rehmsmeier, Bioinformatics 17 Suppl 1, S182 (2001).

32. F. Chiaromonte, V. B. Yap, W. Miller, Pac Symp Biocomput, 115 (2002).

33. S. F. Altschul, R. Bundschuh, R. Olsen, T. Hwa, Nucleic Acids Res 29, 351 (2001).

34. A. A. Schaffer et al., Nucleic Acids Res 29, 2994 (2001).

35. R. Durbin, Biological sequence analysis : probabilistic models of proteins and nucleic acids (Cambridge University Press, Cambridge, UK New York, 1998).

Share your thoughts



Leave a Reply

You must be logged in to post a comment.