EXPERIMENT 3
Information Search Research & Design of ID3 algorithm rule-based anti-spam email filtering
What follows is a short report on the ID3 algorithm with respect to anti-spam email filtering
ID3 algorithm for anti-spam email filtering
Introduction: With the popularity of the Internet, email has become a common and important communication tool for the users using computers. However, one of the annoying sideeffects caused by the abuse of emails is the proliferation of the unsolicited commercial emails. The so-called unsolicited commercial email is defined as an electronic message in which (1) the recipient‟s personal identity and context are irrelevant; and (2) the recipient has not verifiably granted deliberate, explicit, and still revocable permission for it to be sent; and (3) the transmission and reception of the message appears to the recipient to give a disproportionate benefit to the sender Other names of such messages include unsolicited
bulk mail, excessive multi-posting, bulk email, junk email, or spam. Conversely, normal email is referred to as ham. Internet users unavoidably incur the cost of time spent consulting, identifying, and deleting spam. This does not include the cost of potential damage caused by viruses, consumer fraud and identity theft. Besides, spam causes costly productivity loss and wastes human resources on detecting and handling spam. Therefore, anti-spamming has been becoming an important issue for Internet users. Early spam filtering tools rely on black-lists, white-lists, and hand-crafted rules that search for particular keywords, phrases, or suspicious patterns in the headers or message bodies.
2|Page
Spam is commonly defined as unsolicited email messages and the goal of spam categorization is to distinguish between spam and legitimate email messages. The economics of spam details that the spammer has to target several recipients with identical and similar email messages. As a result a dynamic knowledge sharing effective defense against a substantial fraction of spam has to be designed which can alternate the burdens of frequent training stand alone spam filter. A weighted email attribute based classification is proposed to mainly focus to encounter the issues in normal email system. These type of classification helps to formulate an effective utilization of our email system by combining the concepts of Bayesian Spam Filtering Algorithm, Iterative Dichotmiser 3(ID3) Algorithm and Bloom Filter. The details captured by the system are processed to track the original sender causing disturbances and prefer them to block further mails from them. We have tested the effectiveness of our scheme by collecting offline data from Yahoo mail & Gmail dumps. This proposal is implemented using .net and sample user-Id for knowledge base.
Email: Electronic mail, commonly called email or e-mail, is a method of exchanging digital messages across the Internet or other computer networks. Originally, email was transmitted directly from one user to another computer. This required both computers to be online at the same time, a la instant messaging. Today's email systems are based on a store-andforward model. Email servers accept, forward, deliver and 3|Page
store messages. Users no longer need be online simultaneously and need only connect briefly, typically to an email server, for as long as it takes to send or receive messages. An email message consists of two components, the message header, and the message body, which is the email's content. The message header contains control information, including, minimally, an originator's email address and one or more recipient addresses and the body contains the message itself as unstructured text; sometimes containing a signature block at the end. This is exactly the same as the body of a regular letter. The header is separated from the body by a blank line.
Spamming Behaviour: The objective of sending spams is to sell products or services to the customers possibly available on the Internet. For this purpose, spams are massively and repeatedly dispatched in order to broadly contact potential customers. However, in order not to be detected, spams are elaborately pretended as hams. The pretending tricks usually have some or all of the following characteristics. The socalled spamming behavior is a method that spammers use for composing or delivering a spam for specific purposes, such as the ones mentioned above. In most cases, hams are composed and delivered normally and the values of all fields in the headers and syslogs are correctly and properly given. Conversely, the associated headers and syslogs of spams may contain inconsistent or abnormal information such as the boxed parts in Figure 1, which reveal the existence of 4|Page
spamming behaviors. The concept of spamming behaviors is first presented in which claims that such behaviors can be used for identifying spams. Based on this concept, we apply induction-based machine learning techniques for construction classification rules for spam filtering.
ID3 Algorithm: ID3 algorithm is an example of Symbolic Learning and Rule Induction. It is also a supervised learner which means it looks at examples like a training data set to make its decisions. It was developed by J. Ross Quinlan back in 1979. It is a decision tree that is based on mathematical calculations.
1. Decision Trees: 5|Page
A decision tree classifies data using its attributes. It is upside down. The tree has decision nodes and leaf nodes. In Fig 3, “linkFromAcademias” attribute is a decision node and the “author” attribute is the leaf node. The leaf node has homogenous data which means further classification is not necessary. ID3 algorithm builds similar decision trees until all the leaf nodes are homogenous.
2. Training Data and Set: ID3 algorithm is a supervised learner. It needs to have training data sets to make decisions. The training set lists the attributes and their possible values. ID3 doesn‟t deal with continuous, numeric data which means we have to descretize them. Attributes such age which can values like 1 to 100 are instead listed as young and old. Attributes
Values
Age
Young, Old
Height
Tall, Short 6|Page
Employed
Yes, No
The training data is the list of data containing actual values.
3. Entropy:
Entropy refers to the randomness of the data. It ranges from 0-1. Data sets with entropy 1 means it is as random as it can get. A data set with entropy 1 means that it is homogenous. In Fig 6, the root of the tree has a collection of Data. It has high entropy which means the data is random. The set of data is eventually divided into subsets 3, 4, 5 and 6 where it is now homogenous and the entropy is 0 or close to 0. Entropy is calculated by the formula: E(S) = -(p+)*log2(p+ ) - (p- )*log2(p- )
7|Page
“S” represents the set and “p+” are the number of elements in the set “S” with positive values and “p –“ are the number of elements with negative values. The purpose of ID3 algorithm is to classify data using decision trees, such that the resulting leaf nodes are all homogenous with zero entropy.
4. Gain: In decision trees, nodes are created by singling out an attribute. ID3‟s aim is to create the leaf nodes with homogenous data. That means it has to choose the attribute that fulfils this requirement the most. ID3 calculates the “Gain” of the individual attributes. The attribute with the highest gain results in nodes with the smallest entropy. To calculate Gain we use: Gain(S, A) = Entropy(S) - S ((|Sv| / |S|) * Entropy(Sv)). In the formula, „S‟ is the set and „A‟ is the attribute. „SV „ is the subset of „S‟ where attribute „A‟ has value „v‟. „|S|‟ is the number of elements in set „S‟ and „|Sv|‟ is the number of elements in subset „Sv‟. ID3 chooses the attribute with the highest gain to create nodes in the decision tree. If the resulting subsets do not have entropy zero or equal to zero then it chooses one of the remaining attribute to create further nodes until all the subsets are homogeneous.
Weaknesses of ID3 Algorithm: ID3 uses training data sets to makes decisions. This means it relies entirely on the training data. The training data is input by the programmer. Whatever 8|Page
is in the training data is its base knowledge. Any adulteration of the training data will result in wrong classification. It cannot handle continuous data like numeric values so values of the attributes need to be discrete. It also only considers a single attribute with the highest attribute. It doesn‟t consider other attributes with less gain. It also doesn‟t backtrack to check its nodes so it is also called a greedy algorithm. Due to its algorithm it results in shorter trees. Sometimes we might need to consider two attributes at once as a combination but it is not facilitated in ID3. For example in a bank loan application we might need to consider attributes like age and earnings at once. Young applicants with fewer earnings can potentially have more chances of promotion and better pay which will result in a higher credit rating.
EXPERIMENTAL RESULTS AND PERFORMANCE EVALUATION: The spambase dataset is downloaded from the UCI machine learning repository[23] in the form of text file. This dataset contains 57 input attributes of continuous format and 1 target attribute in discrete format. Then feature construction is done for feature transformation. Since the training dataset contains all the input attributes as continuous and target attribute as discrete, the following four feature selection algorithms namely, Fisher filtering, ReliefF, Runs Filtering and Step disc are executed on this dataset for retrieving relevant features. Classification algorithms such as Naive bayes continuous, ID3 ,K-NN, multilayer perceptron, CSVC, Linear discriminant analysis, CS-MC4, Rnd tree, PLS9|Page
LDA, PLS-DA etc, are applied to each of the above filtering algorithms. Fisher filtering produces above 95% accurate results for 3 classifiers (C4.5, CS-MC4 and Rnd – tree classification algorithms); above 90% accuracy for 8 classifiers and above 85% for 6 classifiers. ReliefF filtering produce above 95% accuracy for only 1 classifier (Rnd Tree ); above 90% accuracy for 6 classifiers; above 85% for 6 classifier and above 80% for 4 classifiers. Runs filtering and Stepwise discriminant analysis provides best result for 2 classifiers (C4.5 and CS-MC4); above 90% for 10 classifiers and above 85% accuracy for 5 classifiers. Runs filtering and Step disc feature selection algorithms almost provide the same result. From the results, the Rnd tree classification is considered as a best classifier, as it produced 99% accuracy through fisher filtering feature selection.
10 | P a g e
Conclusion: Email spam classification has received a tremendous attention by majority of the people as it helps to identify the unwanted information and threats. Therefore, most of the researchers pay attention in finding the best classifier for detecting spam emails. From the obtained results, fisher filtering and runs filtering feature selection algorithms performs better classification for many classifiers. The Rnd tree classification algorithm applied on relevant features after fisher filtering has produced more than 99% accuracy in spam detection. This Rnd tree classifier is also tested with test dataset which gives accurate results than other classifiers for this spam dataset.
11 | P a g e