To Cluster Or Not To Cluster: The Uses and Misuses of Clustering Algorithms

Abstract
Clustering is a standard part of microarray data analysis. Indeed, the heatmaps and dendrograms characteristic of clustering algorithms are nearly synonymous with microarrays. But not all papers apply clustering algorithms in a valid or meaningful way. This review aims to offer some general guidelines of how a reader can distinguish appropriate cluster analyses from those that are uninformative or invalid.


Introduction
Clustering algorithms, with their associated heatmaps and dendrogram representations, are nearly inseparable from microarray-based papers. But while these graphical depictions of huge microarray datasets can be visually impressive, they also raise a number of questions. How should they be interpreted? What types of problems should clustering be applied to? Are there alternative approaches to clustering? How can a biologist identify a poor or inappropriate use of clustering from a well-founded application?

What is clustering?
Clustering algorithms are a part of the more general field of “unsupervised pattern recognition”. These algorithms are capable of finding patterns in data (pattern recognition) without any prior information about the structure or organization of the data (unsupervised).

Given the immense popularity of clustering algorithms, it might be surprising to realize that unsupervised pattern recognition is a very small field. The overall field of pattern recognition is mainly comprised of algorithms that involve prior information about the data: so called “supervised pattern recognition”. For example, in the major text on pattern recognition (1), only a single chapter is dedicated to unsupervised algorithms!

While there are many varieties of clustering algorithms, they can basically be grouped into two major classes: hierarchical algorithms and portioning algorithms. These two differ in terms of the basic assumptions they make about the underlying data. Both types of algorithm are frequently used in microarray data-analysis.

A hierarchical algorithm assumes that there is an ordered structure – somewhat like a ranking – inherent to the dataset. These ordered structures are usually represented by the “dendrograms” (Figure 1). Ordering can either be calculated by splitting the entire dataset into successively smaller groups (divisive hierarchical clustering) or by successively aggregating individual data-points into larger groups (agglomerative hierarchical clustering). For a variety of reasons, agglomerative hierarchical clustering can be extremely computationally intensive, and is unmanageable for some very large datasets.

Figure 1: The anatomy of a clustering diagram A sample clustering of a public microarray dataset, using agglomerative hierarchical clustering. Thetree-like structures on the top and left are called “dendrograms”, and they give the ordering of samplesdetermined by clustering. The longer the line that joins a pair of samples, the less related are theexpression profiles for those two samples. The shaded blocks in the centre of the picture form a“heatmap”. Heatmaps often use a red/green colour-scale to represent up/down regulation, but morerecently a broad variety of gray-scale, yellow/blue, and blue/red heatmaps have also been used.Often heatmaps are given without a scale, which may indicate that the within each row & column hasbeen standardized to a (0,1) scale.

Partitioning algorithms, by contrast, do not posit any specific ordering of the dataset. Instead, they assume that the number of distinct groups within the dataset is known a priori. While a number of techniques have been designed to estimate this cluster-number in advance, in general, selection remains somewhat arbitrary and usually needs to be guided by some knowledge of the underlying dataset. Once the number of clusters is determined, partitioning algorithms work in a straight-forward way. They first randomly assign data-points to different clusters. Next, they try moving points one-by-one from one cluster to the next, seeking to improve the “separation” of the clusters from one another. Once the algorithm reaches a stage where moving a point from one cluster to another cannot improve separation, it stops.

Two very common partitioning algorithms are k-means clustering and self-organizing maps.

While partitioning algorithms are much less computationally demanding than hierarchical algorithms, they are also acutely sensitive both to the number of clusters selected and to the initial random assignment.

Clustering algorithms have been reviewed extensively elsewhere, and this brief introduction only provides a few key details. The classic text on statistical pattern-recognition, Duda & Hart, includes a lucid overview of clustering algorithms (1). Several excellent reviews discussing the mathematics of unsupervised pattern-recognition from a biological perspective are available (2-7). Further, clustering has been employed in a wide range of research fields (8-12). Finally, a number of groups have proposed methods aimed at improving the visualization (13, 14) or accuracy (14-18) of clustering algorithms, or at providing techniques for validating (19-22) the performance of these algorithms on microarray data.

How is clustering misused?

Misuse #1: Clustering “pre-selected” datasets
The strength of clustering algorithms is their ability to find patterns in data without employing prior knowledge of the dataset. As such, if a dataset has been pre-selected (filtered), the results of the clustering algorithm become much less useful.

For example, consider a microarray experiment examining the response of various cell-lines to some treatment. Clustering this dataset might identify groups of genes with a similar behaviour across the various cell-lines. However, if the microarray data are first subjected to a statistical analysis to identify genes that differ between the cell-lines, then clustering will add no additional information. Instead, clustering is now a visualization method, not a method for revealing new biological insights. Indeed, in such cases, it is often acceptable to order the data manually in an appropriate way, rather than to employ clustering algorithms (23).

Thus, it is often inappropriate to use clustering when a visualization technique is needed.

Misuse #2: Clustering to detect differential expression
Often clustering algorithms are used to identify genes that are differentially expressed between two conditions. For example, if a cell-line is treated with either a drug or a control and microarrays run for each condition, some groups will cluster the resulting data.

This approach is flawed for two reasons. First, clustering algorithms will find groups in random data – in other words they are not capable of making statistical assessments about reproducibility. Second, clustering algorithms can miss small, but replicable, effects, particularly when multiple classes (e.g., multiple drugs) are being considered. Thus clustering algorithms lack both specificity and sensitivity for the detection of differential expression.

In fact, a wide variety of extremely well-established statistical techniques exist to solve this problem. Simple techniques like t-tests and ANOVAs (F-tests) are often appropriate when combined with multiple-testing adjustments (24). More complex Bayesian methods are also available (25, 26).

Misuse #3: Clustering to recapitulate a biological trend
Consider a time-course microarray experiment, where the effect of some drug on transcript levels is assessed across a range of time-points. This data might then be clustered, with the aim of showing both the reproducibility of the microarray analysis and the biological groupings of specific time-points. Clustering is in fact an appropriate way of demonstrating both results, but most such analyses have two key flaws.

First, it is essential that the complete dataset be used (Misuse #1) or the conclusions drawn will only be true for the small subset of changed genes rather than for the entire cell. Using only pre-selected differentially expressed genes will turn the cluster-analysis into a mere visualization, and prevent new conclusions from being drawn.

Second, it is not sufficient to show a perfect or near-perfect clustering as evidence that a known biological trend could be recapitulated from RNA-level data. Some assessment of the chance that this could occur at random is required. Not doing so is somewhat analogous to showing data without error-bars. There are a variety of techniques available for assessing “cluster homogeneity”, but perhaps the most common is the “adjusted Rand Index” (16). Just as a p-value should be associated with most claims of differential expression, so too should Rand Index values be associated with claims that microarray clusters have recapitulated a known biological trend.

Misuse #4: Unverified clustering reproducibility
In any study using partitioning algorithms there are two elements that must be discussed for the work to have credibility. First, it is critical that the selection of the number of subsets be justified on some basis – be it prior biological knowledge, empirical assessment of the data using some external criteria, or use of a size-determination metric. Without this information it becomes unclear to what extent observed effects are merely artifacts of parameter-choice (27).

Secondly, it is equally critical that any partitioning algorithm be repeated (iterated) many times to help ensure that an appropriate solution was found. This is particularly important for complex datasets. A simple analogy may help clarify the importance of this often-omitted step. Consider a simple method of trying to find the highest point on the earth: start at a random point and keep walking successively higher until you can see no higher point in the distance. If one starts near a mountain-range this approach will work well, but if one starts in a flat region this approach will work poorly.

Partitioning algorithms work in something like this manner, and thus it is always necessary to iterate k-means or SOM clustering many times (often tens of thousands). If the same “highest point” is found several times, then one can have some confidence that this is indeed the peak, although even this is not guaranteed.

Thus, this step is critical. Unfortunately when this step is not reported in a manuscript, it is often unclear if the authors omitted to mention this part of the analysis in the text, or if they actually omitted performing the analysis entirely.

How can clustering be used appropriately?

Use #1: Clustering to predict class assignment
One of the classic uses of clustering algorithms is to predict new properties of a gene, based on its co-clustering partners. This was first done by Hughes et al. in yeast, where a panel of 300 yeast microarray experiments was clustered (28). Each cluster was then tested for functional homogeneity using a statistical algorithm. Many clusters were found to be highly enriched for genes with a particular biological function, such as RNA-binding. Uncharacterized genes within these clusters were then predicted to also have this function, and in a number of cases Hughes et al. verified these predictions experimentally. Clustering can thus be used to predict function or properties through enrichment-analysis.

Use #2: Clustering to bootstrap other analyses
One of the assumptions underlying many clustering analyses is the idea that co-expressed genes are also co-regulated. This idea has been extensively exploited by Nir Friedman’s group to predict novel transcription-factor binding-sites. Their approach was elegant: genes were clustered together based on expression profiles. The promoter regions of the resulting clusters were then analyzed for known or novel DNA motifs that might represent transcription-factor binding-sites (29, 30). Again, some of the predictions were experimentally verified. This approach has since been applied to human datasets (31), and the algorithm has been adapted to accommodate experimental transcription-factor binding (ChIP-chip) data (32).

Use #3: Data integration
In theory, clustering analyses can be exploited for merging disparate types of data. For example, gene-expression measures from disparate technologies such as microarrays and SAGE can be combined together using clustering algorithms (33). Similarly, broad-based data-integration techniques employing supervised pattern recognition techniques have also been proposed (34).

Conclusions

Clustering algorithms can be a critical tool for the analysis of complex, multi-dimensional datasets. Despite the problems noted here, it is likely that unsupervised analysis approaches will only become more important as high-throughput experiments become a standard part of molecular biology. As such, developing the ability to detect poorly applied clustering techniques is a valuable skill.

References

1. R. O. Duda, P. E. Hart, D. G. Stork. Pattern classification. 2nd ed. (Wiley, New York, New York 2001).

2. G. Sherlock, Brief Bioinform 2, 350 (2001).

3. F. Azuaje, Brief Bioinform 4, 31 (2003).

4. R. B. Altman, S. Raychaudhuri, Curr Opin Struct Biol 11, 340 (2001).

5. S. Raychaudhuri, P. D. Sutphin, J. T. Chang, R. B. Altman, Trends Biotechnol 19, 189 (2001).

6. R. Simon, M. D. Radmacher, K. Dobbin, L. M. McShane, J Natl Cancer Inst 95, 14 (2003).

7. P. C. Boutros, A. B. Okey, Brief Bioinform 6, 331 (2005).

8. C. W. Whitfield, A. M. Cziko, G. E. Robinson, Science 302, 296 (2003).

9. J. M. Stuart, E. Segal, D. Koller, S. K. Kim, Science 302, 249 (2003).

10. S. Tavazoie, J. D. Hughes, M. J. Campbell, R. J. Cho, G. M. Church, Nat Genet 22, 281 (1999).

11. S. S. Daoud et al., Cancer Res 63, 2782 (2003).

12. K. C. Fertuck, J. E. Eckel, C. Gennings, T. R. Zacharewski, Physiol Genomics 15, 127 (2003).

13. D. R. Gilbert, M. Schroeder, J. van Helden, Trends Biotechnol 18, 487 (2000).

14. M. Sultan et al., Bioinformatics 18, Suppl 1, S111 (2002).

15. D. Hanisch, A. Zien, R. Zimmer, T. Lengauer, Bioinformatics 18, Suppl 1, S145 (2002).

16. K. Y. Yeung, C. Fraley, A. Murua, A. E. Raftery, W. L. Ruzzo, Bioinformatics 17, 977 (2001).

17. G. J. McLachlan, R. W. Bean, D. Peel. Bioinformatics 18, 413 (2002).

18. G. Getz, E. Levine, E. Domany, Proc Natl Acad Sci U S A 97, 12079 (2000).

19. S. Datta, Bioinformatics 19, 459 (2003).

20. I. Gat-Viks, R. Sharan, R. Shamir, Bioinformatics 19, 2381 (2003).

21. M. K. Kerr, G. A. Churchill, Proc Natl Acad Sci U S A 98, 8961 (2001).

22. L. M. McShane et al., Bioinformatics 18, 1462 (2002).

23. A. P. Davierwala et al., Nat Genet 37, 1147 (2005).

24. J. D. Storey, R. Tibshirani, Proc Natl Acad Sci U S A 100, 9440 (2003).

25. G. K. Smyth, Statistical Applications in Genetics and Molecular Biology 3,1 (2003).

26. W. Pan, Bioinformatics 18, 546 (2002).

27. K. H. Pan, C. J. Lih, S. N. Cohen, Proc Natl Acad Sci U S A 102, 8961 (2005).

28. T. R. Hughes et al., Cell 102, 109 (2000).

29. E. Segal et al., Nat Genet 34, 166 (2003).

30. E. Segal, R. Yelensky, D. Koller, Bioinformatics 19, Suppl 1, I273 (2003).

31. R. Elkon, C. Linhart, R. Sharan, R. Shamir, Y. Shiloh, Genome Res 13, 773 (2003).

32. I. Nachman, A. Regev, N. Friedman, Bioinformatics 20, Suppl 1, I248 (2004).

33. P. M. Haverty, L. L. Hsiao, S. R. Gullans, U. Hansen, Z. Weng, Bioinformatics 20, 3431 (2004).

34. G. R. Lanckriet, T. De Bie, N. Cristianini, M. I. Jordan, W. S. Noble, Bioinformatics 20, 2626 (2004).

Share your thoughts



Leave a Reply

You must be logged in to post a comment.