|
Modern computers generally come with some ability to send
spam. The only necessary ingredient is the list of addresses
to target. Spammers obtain email addresses by a number of
means: harvesting addresses from Usenet postings, DNS listings,
or Web pages; guessing common names at known domains (known
as a dictionary attack); and "e-pending" or searching
for email addresses corresponding to specific persons, such
as residents in an area. Many spammers utilize programs called
web spiders to find email addresses on web pages, although
it is possible to fool the web spider by substituting the
"@" symbol with another symbol, for example "#",
while posting an email address. As a result, users have to
waste their valuable time to delete spam emails. Moreover,
because spam emails can fill up the storage space of a file
server quickly, they could cause a very severe problem for
many websites with thousands of users.
Currently, much work on spam email filtering has been done
using the techniques such as decision trees, Naive Bayesian
classifiers, neural networks, etc. To address the problem
of growing volumes of unsolicited emails, many different methods
for email filtering are being deployed in many commercial
products. We constructed a framework for efficient email filtering
using ontology. Ontologies allow for machine-understandable
semantics of data, so it can be used in any system. It is
important to share the information with each other for more
effective spam filtering. Thus, it is necessary to build ontology
and a framework for efficient email filtering. Using ontology
that is specially designed to filter spam, bunch of unsolicited
bulk email could be filtered out on the system. We used Waikato
Environment for Knowledge Analysis (Weka) explorer, and Jena
to make ontology based on sample dataset.
Figure1. SPONGY Architecture
Emails can be classified using different methods.
Different people or email agents may maintain their own personal
email classifiers and rules. The problem of spam filtering
is not a new one and there are already a dozen different approaches
to the problem that have been implemented. The problem was
more specific to areas like artificial intelligence and machine
learning. Several implementations had various trade-offs,
difference performance metrics, and different classification
efficiencies. The techniques such as decision trees, Naive
Bayesian classifiers, and Neural Networks had various classification
efficiencies.
Figure 1 shows our framework to filter spam. The training
dataset is the set of email that gives us a classification
result. The test data is actually the email will run through
our system which we test to see if classified correctly as
spam or not. This will be an ongoing test process and so,
the test data is not finite because of the learning procedure,
the test data will sometimes merge with the training data.
The training dataset was used as input to J48 classification.
To do that, the training dataset should be modified as a compatible
input format. After J48 classification procedure, classification
result was created.
To query the test email in Jena, an ontology should be created
based on the classification result. To create ontology, an
ontology language was required. RDF was used to create an
ontology. The classification result in the form of RDF file
format was inputted to Jena, and inputted RDF was deployed
through Jena, finally, an ontology was created. Ontology generated
in the form of RDF data model is the base on which the incoming
mail is checked for its legitimacy. Depending upon the assertions
that we can conclude from the outputs of Jena, the email can
be defined as spam or otherwise. The email is actually the
email in the format that Jena will take in (i.e. in a CSV
format) and will run through the ontology that will result
in spam or not spam.
SPONGY system updates periodically the dataset with the emails
classified as spam when user spam report is requested. Then,
modified training dataset is inputted to WEKA to get a new
classification result. Based on the classification result,
we can get new ontology, which can be used as a second spam
filter. Through this procedure, the number of ontology will
be increased. Finally, this spam filtering ontology will be
customized for each user. User customized ontology filter
would be different with each other depending on each user’
background, preference, hobby, etc. That means one email might
be spam for person A, but not for person B. SPONGY system
provides evolving spam filter based on user’s preference,
so user can get better spam filtering result.
The input to the system mainly is the training dataset and
then the test email. The test email is the first set of emails
that the system will classify and learn and after a certain
time, the system will take a variety of emails as input to
be filtered as a spam or not. The training dataset which we
used, which had classification values for features on the
basis of which the decision tree will classify, will first
be given to get the same. The classification results need
to be converted to an ontology. The decision result which
we obtained J48 classification was mapped into RDF file. This
was given as an input to Jena which then mapped the ontology
for us. This ontology enabled us to decide the way different
headers and the data inside the email are linked based upon
the word frequencies of each words or characters in the dataset.
The mapping also enabled us to obtain assertions about the
legitimacy and non-legitimacy of the emails. The next part
was using this ontology to decide whether a new email is a
spam or not. This required querying of the obtained ontology
which was again done through Jena. The output obtained after
querying was the decision that the new email is a spam or
not.
The primary way where user can let the system know would be
through a GUI or a command line input with a simple ‘yes’
or ‘no’. This would all be a part of a full fledged
working system as opposed to our prototype which is a basic
research model.
|