|
With the recent advancement of DNA microarray
technologies, the expression levels of thousands of genes
can be measured simultaneously. The obtained data are usually
organized as a matrix where the columns represent genes (usually
genes of the whole genome), and the rows correspond to the
samples (e.g. various tissues, experimental conditions, or
time points). Given this rich amount of gene expression data,
it is essential to extract hidden knowledge from this matrix.
One of the key steps in gene expression analysis is to perform
the clustering of genes that show similar patterns. By identifying
a set of gene clusters, we can hypothesize that the genes
clustered together tend to be functionally related. Thus,
gene expression clustering may be useful in identifying mechanisms
of gene regulation and interaction, which can be used to understand
the function of a cell.
Since gene expression data consists of measurements across
various conditions (or time points), they are characterized
by multi-dimensional, huge size in terms of volume, and noisy
data. Thus, clustering algorithms must be able to address
and exploit such features of the datasets. Recent database
mining research has proposed density-based clustering algorithms,
which are relevant for multi-dimensional noisy datasets. By
addressing the limitations of previous density-based clustering
methods, we present a KNN (k-nearest neighbor) density estimation
clustering algorithm that is relevant for producing co-expressed
gene clusters. In addition, to reduce the complexity of the
presented clustering technique, we explored different optimization
schemes. Preliminary experimental results indicate that the
proposed method successfully identifies co-expressed gene
clusters for yeast time-series datasets.
The above figure shows sample co-expressed gene clusters that
are identified by our clustering algorithm. As shown, our
clustering algorithm is able to capture co-expressed genes
successfully. In addition, to verify the produced clusters
are biologically meaningful, we examined the GO annotation
for genes within a same cluster. The following figure shows
the GO annotation of the sample genes in the co-expressed
gene cluster, which is generated by GoTermFinder. As illustrated,
the concepts correspond to the gene within a same cluster
are semantically similar, thus, our clustering algorithm can
also capture biologically meaningful clusters.

Finally, based on the observation that there
exists a priori biological knowledge in bioinformatics, we
developed constrained clustering algorithm that incorporates
background knowledge early in the clustering process. The
Gene Ontology is employed to assess relevant background knowledge,
which is defined as pairwise instance-level constraints (e.g.,
two genes must belong to a same cluster). Based on the presence
of a small amount of background knowledge, we perform gene
expression clustering to produce biologically meaningful as
well as co-expressed gene clusters. Brief information can
be found in our poster.
|