|
Motivation and Technology Background
With the rapid progress in data acquisition, sensing, and
storage technologies, the size of available datasets is increasing
at a rapid rate. Given this overwhelming amount of data, data
mining, which transforms raw datasets into useful higher-level
knowledge, becomes a must in analyzing and understanding the
information. The raw dataset is consists of objects that are
usually represented as a vector of numeric values in multidimensional
space. Higher-level of knowledge can be extracted by discovering
groups (clusters) of similar objects from the raw dataset.
And, traditional clustering methods, which consider all of
the dimensions (attributes) for measuring similarity between
objects, have taken a leading role in data mining. However,
there are several weaknesses for dealing with high dimensional
data with traditional clustering frameworks. This is caused
by inherent characteristics of high dimensional raw data,
which are the increasing irrelevance of some dimensions, and
the tendency to be nearly equidistant between objects as the
number of dimensions in the dataset increases. For example,
in Figure1, same objects that don’t show strong correlation
in left side graph can have correlation in right side graph.
Feature selection methods have been used for discovering relevant
dimensions from high dimensional space, and clusters on them,
however the feature selection methods have a critical limit
that uncovers relations between objects in multiple, overlapping
sub-dimensional spaces.
Figure 1. Full dimension vs. Subspace
Subspace clustering algorithms are suggested to overcome inherent
problems in full-dimensional clustering frameworks and feature
selection methods. The main purpose of subspace clustering
algorithms is to find meaningful groups of objects in multiple
sub-dimensions that may be overlapped, because clusters may
be embedded in lower dimensional subspaces. In addition, different
sets of dimensions may be relevant for different sets of objects,
and an object can have multiple characteristics in different
sub-spaces.
There are two main streams of previous approaches in subspace
clustering. Top-down algorithms (e.g., PROCLUS, FINDIT, etc)
find full-dimensional clusters first, and then refine them
to find subspace clusters. Bottom-up strategies (e.g., CLIQUE,
MAFIA, etc.) start from finding low dimensional dense regions,
and then use them to make relevant subspace clusters. However,
both of them have a major weakness on scalability of dimensions.
For example, the number of dimensions (m) for gene expression
data usually varies from 20 to 100, and the number of dimensions
of feature vectors for documents is well over 100,000. When
m is equal to 50, the number of possible subspaces is 250
– 1. It is computationally expensive to search all subspaces
to identify clusters. To cope with this problem, bottom-up
type of algorithms first identify clusters in low dimensional
spaces, and use them to find clusters in higher dimensional
spaces based on the a priori algorithm. However, this approach
is not scalable with the number of dimensions in general.
This motivates the necessity of a subspace clustering algorithm
that is scalable with the number of dimensions. To this end,
we develop a novel subspace clustering algorithm that utilizes
domain transformation from objects to object-object relations.
Current efforts and capabilities
Our efforts on data mining are mainly focused on gene expression
analysis [1, 2]. This is because there can be valuable hidden
high-level knowledge in gene expression datasets that consist
of the expression levels of thousands of genes under hundreds
of conditions. The other reason that we use gene expression
data is that most genes are expected to be strongly correlated
under certain sub-conditions (e.g. Figure1), and these genes
have similar biological functions. In addition to gene expression
analysis, we also focus our research on word subspace clustering.
Figure 2. Overview of SCLUDOT
Our current capabilities on data mining include the following:
-
Identifying coherent patterns on subsets of gene expression
datasets [1]: We propose a novel Subspace clustering algorithm,
referred to as SCLUDOT (Subspace CLUstering based on DOmain
Transformation), for identifying coherent sub-patterns of
gene expression datasets. The overview of our proposed subspace
clustering algorithm is shown on Figure2. By performing
discretization (step1 in Figre2) to gene expression profiles,
each gene can be expressed as a sequence of symbols instead
of a sequence of numeric values.
As gene pairs may be similar in certain conditions and not in other conditions, the maximum
number of same symbols in gene pairs would be less than
or equal to total number of conditions, even though all
genes have same length of sequence of symbols (total number
of conditions). Now, the similarity between two genes is
transformed as a sequence of symbols that represents the
maximal subspace cluster for the gene pair (step2 in Figure2).
This domain transformation (from genes into gene-gene relations)
allows us to make the number of possible subspace clusters
dependent on the number of genes (objects) instead of number
of conditions (dimensions).
Based on the symbolized genes,
we present an efficient subspace clustering algorithm that
is linearly scalable to the number of dimensions. In addition,
the running time can be drastically reduced by utilizing
inverted indexing and pruning non-interesting subspaces.
Experimental results so far indicate that the proposed method
efficiently identifies co-expressed gene subspace clusters
for a yeast cell cycle dataset. However, SCLUDOT is not
limited to gene expression mining, but applicable to subspace
clustering in general for high-dimensional massive datasets.
-
SCLUDOT-based metadata management: To strengthen our
work in terms of generality, we plan to apply SCLUDOT to
word subspace clustering. In order to retrieve rich semantic
information from the overwhelming amount of unstructured
data, metadata (e.g., ontologies) can be employed. Since
manually building and maintaining such metadata is nearly
impossible, an automated dynamic metadata management technique
will play an important role.
We proposed a novel algorithm
called WebSim (Web-based term Similarity metric) [4,5],
whose feature extraction and similarity model is based on
a conventional Web search engine.
There are two main benefits
of utilizing a Web search engine. First, we can obtain the
freshest content for each term that represents the up-to-date
knowledge on the term. Second, in comparison with the approaches
that use the certain amount of crawled Web documents as
a corpus, our method is less sensitive to the problem of
data sparseness because we access as much content as possible
using a search engine.
A new approach that combines WebSim
with SCLUDOT gives a promising dynamic metadata management
model. Even though WebSim provides more reliable metrics
for relationships between terms or concepts, it is still
suffering from dealing with ambiguity of terms.
Because SCLUDOT finds clusters in multiple, overlapped subspaces
in the manner of being scalable to the number of dimensions,
it has the capability to extract multiple relations between
terms or concepts under different subsets of features in
very high dimensional feature vector space extracted by
WebSim. And, these subspace clusters reveal hidden term
relationships that cannot be found by full-dimensional term
similarity metrics.
At the core of our current research efforts, SCLUDOT plays
a key role. An efficient subspace clustering algorithm promises
huge benefits on diverse problem domains, and SCLUDOT may
be a stepping stone to the successful subspace clustering
framework.
Possible problem domains of subspace clustering
Although all the high dimensional data can be considered as
input data of subspace clustering, datasets in certain problem
domains have recently received scientific attention:
-
Gene expression analysis: Since gene expression datasets
consist of measurements across various conditions (or time
points), they are characterized by high dimensional, huge
size volume, and noisy data. It is well known that genes
can manifest a coherent pattern under subsets of experimental
conditions, and these patterns are closely related with
certain functionality of genes. Therefore, It is essential
to identify such local patterns in gene expression datasets,
which is a key to revealing biologically meaningful clusters.
-
E-commerce: With the fast growing online market, finding
user groups who show similar interests in products is more
important than any other era in history. Also, users can
be identified by diverse characteristics, and current technologies
allow substantial user information to be gathered. However,
it is difficult to analyze the user information data because
of its high dimensionality and the heterogeneity of users’
interests. A subspace clustering with the ability of finding
locally correlated patterns is a promising method for recommendation
systems and target marketing in e-commerce.
? Besides gene expression analysis and e-commerce, there
are several problem domains that are suitable for subspace
clustering. For example, creating a hierarchy of data sources
in an information integration system and organizing unstructured
web search results are possible problem domains where subspace
clustering can be utilized.
Research Plan
Regarding our subspace clustering algorithm (SCLUDOT), we
plan to investigate the following:
-
Improving discritization step: Discritization on each
condition is the first step in SCLUDOT, and plays a key
role in improving the quality of subspace clusters. Currently,
we utilize a k-means algorithm for this step, but it limits
number of possible symbols in each condition (dimension)
to a pre-defined fixed number. But, there can be different
characteristics (e.g. density distribution of values, pre-determined
importance, etc.) of each condition: this means the algorithm
should have an ability to freely determine the number of
symbols in each dimension. So, we plan to apply different
known algorithms for this step, and conduct experiments
to find the most efficient method.
-
Finding concrete subspace clusters: There are still more
issues vis-a-vis SCLUDOT – how we can systematically
determine the bound of number of objects in a subspace cluster,
and what is relevant minimum number of dimensions that provides
meaningful clusters? Currently, a probabilistic model is
used to determine the minimum number of objects in a cluster.
And, the minimum number of dimensions is determined experimentally.
These two issues strongly affect the average running time
and quality of result, because the threshold for the minimum
number of objects in a cluster is related with the pruning
step and total number of obtained subspace clusters, and
the minimum number of dimensions determines the size of
search space. We plan to investigate these issues in diverse
ways.
-
Extending SCLUDOT to pattern-based subspace clustering:
Our current subspace clustering algorithm utilizes value
similarity for finding subspace clusters, but there is high
possibility to extent it to pattern-based clustering. Pattern-based
clustering is performed by pattern similarity, and it has
been widely studied for the purpose of dealing with time-series
data. By adding numeric orders of symbols as another factor
for the clustering, we can achieve pattern-based subspace
clustering.
-
Efficient analysis for the result: In order to interpret
obtained clusters, external knowledge needs to be involved.
It depends on problem domain, because available external
knowledge will be different domain by domain. For gene expression
analysis, we plan to explore the methodology to quantify
the relationship between subspace clusters and known biological
knowledge by utilizing a Gene Ontology.
-
Comprehensive evaluation of the proposed algorithm on
different datasets: For the purpose of qualitatively testing
SCLUDOT, we plan to conduct more comprehensive experiments
on diverse datasets as well as compare our approach with
other competitive algorithms.
Besides improving our subspace clustering algorithm itself,
we plan to extend our research to apply SCLUDOT to other application
domains. SCLUDOT is a general-purpose subspace clustering
algorithm, therefore it may be suitable for e-commerce and
association rule mining by combining SCUDOT with a classification
method. So, we plan to investigate whether SCLUDOT is suitable
to domains such as association rule mining, and recommendation
systems.
|