Clustering is the process
of deciding which genes behave similarly and should be thus considered
members of a cluster of genes, based on their observed expression profiles.
Clustering arbitrary data into clusters of similar
items presents the difficulty of deciding what constitutes a good clustering.
It can be shown that there is no absolute best criterion which
would be independent of the final aim of the clustering. Consequently,
it is the user which must supply this criterion, in such a way that the
result of the clustering will suit their needs.
In the particular case of gene expression clustering,
the desirable outcome of clustering is identification of functional
groups of genes, i.e., identification of clusters of genes involved
in common biological processes.
ArrayMiner uses two intuitively appealing criteria
of clustering quality, namely clustering for minimal total variance
of the clusters or a maximum probability of a Gaussian mixture model of
the data.
It may seem that defining a good criterion for
clustering quality would be sufficient for obtaining good clusters of
genes. However, this is very far from being the case: finding clusters
of genes of minimal total variance or maximal Gaussian probability is
not easy, and not all tools are equally good at the task.
In fact, clustering for minimal total variance
or maximal Gaussian probability is a provably difficult computational
problem (for those well versed with computer science: the problem is NP-Hard).
This implies in particular, that fast one-shot heuristics, such as the
popular method of k-Means
(Web Based), simply cannot be expected
to perform well on this task. As a result, they will supply clusterings
of poor quality with high probability.
For the biologist who runs the clustering software,
the quality of the clustering is of significant importance, as he or she
interprets the clusters as associations of genes that behave similarly.
Hence considering a clustering of poor quality will lead the biologist
into a painful examination (and, hopefully, rejection) of hypotheses purportedly
explaining the bogus associations suggested by the ill formed clusters
in that clustering, causing a potentially serious waste of time and resources.
Conversely, a poor clustering obviously means that better ones are not
supplied. It thus prevents the biologist from examining the probably more
useful associations which would be otherwise suggested by a high-quality
clustering. This may cause the biologist to miss important biological
phenomena, a potentially serious hindrance on their research.
Following the above observations, ArrayMiner
puts a premium on the quality of the clusterings it supplies. Indeed,
we are convinced that it is far better for the biologist to wait a few
minutes for a high-quality clustering that will constitute a solid and
reliable basis for their research, than to get a poor solution in seconds
and then waste months of futile effort researching the bogus associations
of genes suggested by the poor clustering.
To achieve the goal of high quality, ArrayMiner
relies on the proprietary technique of Grouping Genetic Algorithms (GGA).
Developed by Optimal Design over the last decade specifically for clustering
problems, and fine-tuned within ArrayMiner for the specific task of clustering
of gene expression profiles, the GGA supplies clusterings of excellent
quality within reasonable execution times. The GGA is a special variety
of the Genetic Algorithm
(Web Based) technique.
The difficulty of the expression profile clustering
problem has one practical consequence: it is impossible to know with certainty
whether a given clustering is the best possible one. The only way to achieve
that certainty would be to examine all possible clusterings there could
be, yet that is impossible to do, as there are billions of them for even
modest numbers of expression profiles. Since ArrayMiner strives to find
the best clustering of the profiles, it will always try to improve the
one it currently has, even if it is the best possible one. Several options
are available for deciding when to stop the algorithm and adopt the currently
available clustering as the final one.
It is worthy noting that the problem of deciding
when to stop the optimization process is irrelevant in most other clustering
software. The reason of this is that the clustering methods built into
that software are in most, if not all, cases simple heuristics (e.g. the
k-Means method), rather than a well defined optimization procedure like
the GGA. Those methods are therefore simply stopped when they go
out of steam and are unable to search for better clusterings. The
obvious flip side is that they often end up with poor quality clusterings,
with the potentially serious consequences to the user discussed above.
|