|
Introduction
As the Web continues to grow as a vehicle for the distribution
of information, many news organizations are providing newswire
services through the Internet. Given this popularity of the
Web news services, the primary goal of this research is to
develop methodologies for news streams mining. Toward this
end, we have been developing topic mining, which effectively
identifies useful patterns (e.g., metadata, topics, events
that are instances of topics) from news streams. Although
we have worked on news domain, the topic mining framework
can be extended to other kinds of information streams (e.g.,
emails).
Approach
Web news articles are composed of hyperlinks, audio, video,
images, and text. However, since not all news stories have
corresponding multimedia data, text can be a rich source of
information about the news. Given that text is unstructured
data, efficient text mining and access methods are required
to obtain valuable knowledge embedded into the text document.
Therefore, to build a novel framework for an intelligent news
database management and navigation scheme, we utilize techniques
in information retrieval, data mining, machine learning, and
natural language processing.
The above figure illustrates the main parts of the proposed
framework. Topic mining is composed of four components, information
gathering, information preprocessing, information analysis,
and information presentation. A Web crawler retrieves a set
of news documents from a news Web site (e.g., CNN) in the
information gathering stage. Developing an intelligent Web
crawler is another research area, and it is not our main focus.
Thus, we implement a simple Web spider, which downloads news
articles from a news Web site on a daily basis. The retrieved
documents are processed by data mining tools to produce useful
higher-level knowledge (e.g., a document hierarchy, a topic
ontology, etc), which is stored in a content description database.
Instead of interacting with a Web news service directly, by
exploiting knowledge in the database, an information delivery
agent can present an answer in response to a user request.
Achievement
Current capabilities on topic mining from news stream datasets
include the following:
• Efficient incremental hierarchical news document
clustering. Since several hundred news articles are published
everyday at a single Web news site, to cope with such dynamic
environments, we should provide efficient incremental data
mining algorithms. Despite the huge body of research efforts
on document clustering, little work has been conducted in
the context of incremental hierarchical news document clustering.
Our developed clustering algorithm based on a neighborhood-search
has several key advantages, including the scalability with
the high dimensionality, capability to discover clusters
with different shapes and sizes, and ability to provide
succinct description of clusters.
• Topic detection and tracking. Due to the overwhelming
amount of information involved, it is crucial to provide
an intelligent agent that can identify novel information
and track related information for a user. Given a stream
of news articles, topic mining identifies whether a new
document belongs to an existing topic or new topic. Topic
mining also tracks events of interest based on sample news
story. For example, it associates incoming news stories
with the related stories (which were already discussed before),
or it can also monitor the news stream for further stories
on the same topic.
• Topic ontology learning from a news stream. In order
to achieve rich semantic information retrieval, metadata
(e.g., ontological information) should be employed. Since
manually building and maintaining such metadata is nearly
impossible, we developed a prototype system for learning
topic ontologies. A topic ontology is a collection of concepts
and relations. One view of a concept is as a set of terms
that characterize a topic. We employ two generic kinds of
relations, specialization and generalization. The former
is useful when refining a query while the latter can be
used when we generalize the query to increase recall or
broaden the search.

The above figure shows a possible outcome of topic mining.
Each node in a document cluster hierarchy can be associated
with a set of terms, which is referred to as a topic ontology
node. As shown, topic ontologies can characterize a news topic
at multiple levels of generality.
An experimental prototype system has been developed, implemented
and tested to demonstrate the effectiveness of the topic mining
framework. The results show that the proposed clustering algorithm
produces high-quality document cluster hierarchy, and obtained
topic ontology provides an interpretation of the news topics
at different levels of abstraction.
One possible applications of topic mining is to utilize it
for Web search. For example, the incremental document clustering
algorithm can be applied to a stream of Web pages returned
by a search engine. Since topic mining can build a document
cluster hierarchy incrementally, a user can browse a document
cluster hierarchy instead of examining a flat list of documents.
In addition, topic ontologies can be used to suggest alternative
query terms to refine the query.
Please see our
recent presentation at ODBASE, 2003.
|