Wse''

AN EFFECTIVE APPROACH FOR SMS SPAM FILTERING
Venkat Nischala Avirineni Student id- 17481794 Department of computer science vavirineni@latrobe.edu.au
ABSTRACT:
Now-a-days usage of cell phones ,texting and mailing has become the most important means of communication for information exchange. Spam has become a monster for all such sorts of communications. It is a junk of SMS or email sent to different user for advertising their products which usually contains virus, trojanhorses etc. E-mail spam is easy to detect and there are various techniques for filtering it. SMS Spam is a major problem because SMS is very short and its very difficult to detect. This report specifies a new approach which can be applied to filter SMS spam.
KEYWORDS:
Pattern Classification : Spam exhibiting similar properties are classified under same pattern. SMS Spam Legitimate message Tokenization : Bulk messages sent to innocent user through SMS : A message which is a normal message but misunderstood as spam : Dividing the sequence of text into tokens at special characters.
INTRODUCTION:
This report gives detailed understanding on spam filtering, and techniques used in it. Spam refers to the unwanted bulk messages generated via online. Information sharing has become the most important means of communication now-a-days using phones,SMS,E-mail,social networking etc. Spam will harm all the important information . In present world every organization is web based and knowledge sharing is done through e-mails and text messages which requires high end confidentiality and security for their competitive advantage. But spam is a hindrance to all this valuable information. To address these problem spam filtering techniques are employed. Spam filtering will have mechanisms that can detect spam messages or mails and protect the information/ system from getting corrupted. There are various methods proposed to implement spam filtering effectively. Type of spam filtering technique would depend on the way in which it is delivered i.e., through mails or messages. Some of the spam filtering mechanisms will include Context based spam filtering, Bayesian spam filtering etc. Spam filtering techniques would include spam filters which detects spam and avoid's it from reaching user's. Spam filter's gives priorities to mails or messages depending on the user requirements and previous data. They detect spam using the dataset's classified previously. SMS spam is a serious threat to the mobile networks. SMS spam will harm all the confidential information received through SMS. So to solve this problem from effecting organizations secured information spam filtering techniques are used. Spam filters are usually a piece of code which has to be installed as a part of their texting program. Different spam filters work in different ways depending on the technique chosen.
CURRENT RESEARCH WORK:

SMS spam filtering has become the major field for research at the moment. SMS spam messages doesn't have the same framework or format as the email spam since SMS are too small as they have a limit of 160 characters. SMS is usually composed of alphanumeric, symbols, abbreviations etc. It is very difficult to detect spam from such messages because spam can be different for different user's communities in terms of messages. Various experiments were conducted to detect SMS spam. some of them are: Experiment 1: This experiment is conducted to provide good base results for comparisons, already developed email spam filters will have their performance degraded when it comes to short text messages. Text messages consists of several characters structured in different ways in different messages which may affect the filters accuracy. To further carryon with this work two different tokens were classified: Tokenization1: This set of tokens would start with a printable character followed by number of alphanumeric characters excluding special characters from the middle of the pattern. Finally this pattern would include domain names and mail address' split at dots. Employing this technique classifier can easily recognize a domain even if some content in that sub domain is varied. Example: she is doing her assignment. This statement doesn't contain any special characters. She, is, doing , her , assignment all these words are considers to be tokens. Tokenization 2: This category of tokens will include any sequence of characters separated by special characters such as dots, comma's etc. This set of tokens will help in differentiating spam from non spam messages. Example: type@message,delivered. This categorization will baseline the ham class and the left over messages are spam messages. This is filtering of spam messages. Gain Score is a measure to detect spam. So the tokens with highest gain score will be filtered and only the ham class will be left. Experiment 2: When the SMS spam filtering program is running on the device it records huge number of normal and spam messages which is stored in the system log. This log can be analysed to know time and domain features of spam and normal messages. Then the actual behaviour of the spam messages can be found. Time and frequency domain occurrence of spam messages is called frequent time domain area(FTDA). According to this experiment a given day is divided into different time slices and analysed for the time in which the frequency of spam messages is high. So SMS spam filtering technique would consider FTDA. Experiment were conducted incorporating this technique to classify training data and the testing data for normal and spam messages to enlarge datasets. Precision is considered as the ratio of real spam to normal messages and the classified spam/normal messages. Recall is the ratio of real spam/normal message and all the spam/normal messages Few algorithms can be used to filter or separate spam messages from normal messages. These algorithms are machine learning based spam filters. Some of these algorithms are: C4.5: This algorithms uses decision trees to generate output of average results on text classification. Advantage of using this technique is that it is simple and easily understood.
Support Vector Machines: This algorithm produces vectors which tries to separate the spam classes from ham classes. This algorithm has a major advantage that it shows excellent results with text classification. It's major limitation is that it is difficult to analyse its results. Bayesian filtering technique is more appropriate to SMS spam filtering because they are very accurate and immediate. Content based spam filters are built on machine learning algorithms of pre classified messages which are called Bayesian filters. It uses a probability based approach. This approach allows the filter to learn the change in spam .They are a base for many filters. Bayesian filters automatically induce or learn a spam classifier from a set of manually classified examples of spam and legitimate messages. The learning process takes as input the training collection and consists of the following steps: PREPROCESSING : deletion of irrelevant elements and selection of the segments suitable of processing. TOKENIZATION: dividing the message into semantically coherent segments. REPRESENTATION: conversion of a message into an attribute value pairs vector, where the attributes are the previously defined tokens and their values can be binary, frequencies etc. SELECTION: statistical deletion of less predictive attributes. LEARNING: Automatically building a classification model from the collection of messages, as tehy have been previously represented. The shape of the classifier depends on the learning algorithm used, ranging from decision trees or classification rules to statistical linear models, neural networks, generic algorithms etc. Each new target message is pre-processed, tokenized, represented and fed into the classifier, in order to take a classification decision on it. But the limitations of this method are : a. Words that appear in large quantities in spam can be transformed by spammers and this issue can be hardly met by Bayesian filters. b. Some of the words in a text can be replaced by small size pictures which is difficult for filters to recognize. These limitations can be overcome by having a separate training dataset for SMS messages which would be a proper collection of spam messages and employing a technique to add new spam as required to the ongoing process. This approach should have a better tokenization technique to detect unusual messages with pictures or unusual format of text. It should have an analysis stage where SMS can be decided on its spam level. It should have an ability to include user feedback into its system. Conducting these experiments directed the research on spam filtering into a new dimension.
OPINION:
Electronic gadgets are playing a key role in information exchange now-a-days. some of the common means is through SMS, email, social networking sites etc. Most of the organizations, students or in any sector people are preferring to use SMS as a means of communication for information exchange. For example in banking sector even for confirmation of any transaction they prefer sending SMS to their clients. So SMS is playing a key role in present society. SMS can be sent through any means web, mobile phones etc.Spam is a acting as a big monster for SMS user's. Spammer's send bulk messages for advertising. So to protect SMS from spam various Spam filtering techniques were employed. Since usage of SMS has become most common on smart phones. I suggest that a mobile application should be developed which can be supported in every environment. The mobile application's operational process should be as follows: It should consist of different logic layers for providing different functionality. 1. It should include a database in the first layer which should consist of the training data(i.e., previously recorded spam). Whenever a new message is received it should check in the database and compare with previously classified spam. If at least a sub pattern of it matches it should delete the message automatically and prompt user saying that spam message is being detected and deleted. 2. The next layer should be an analysis layer where the incoming message should be analysed by tokenizing the text according to predefined formats. 3. Next layer should be a security layer where all the incoming SMS are checked for its source. The security layer should check if the incoming SMS is from a trusted site or user. It should include encryption/decryption mechanisms. If the message can't be decrypted it should be marked as spam. Security layer should check for digital signature of the incoming SMS source if it is not from the contact list. It can include SSL protocol level security .For example: the one provided by a company called Message media. It offers various SMS services to clients with SSL Protocol security. If the incoming message is from an unknown user i.e., if it is not from the contact list it should ask user if he/she wants to read the message. If they don't want to, it should marked as spam and stored as a pattern in the database layer. Organizing different functionalities into layer would avoid confusion with filters. This application should have a feature to update its database to the latest spam suggested by the network provider. It should be flexible, reliable , cost efficient and time efficient.
JUSTIFICATION: The reason for proposing this solution is that smart phones usage has become quite common now-adays. Every user prefer installing application on smart phones rather than having manual software installation. Myself being a smart phone user, I prefer installing an application to avoid spam instead of installing a separate filter in my program. It's easy to operate when the application alerts about spam and ask if there is any unknown user.
CONCLUSION:
This article is based on SMS spam filtering techniques. It includes the detailing regarding use of filtering to reduce SMS spam. It includes few experiments conducted by the researchers like tokenizing of the incoming SMS, detecting spam based on time domain approach and various algorithms like SVM, Bayesian approach etc. Finally a proposed approach of developing an application and its operations as a solution to reduce spam.
REFERENCES:
Contributions to the Study of SMS Spam Filtering: New Collection and Results -Tiago A. Almeida,Jos Maria Gomez Hidalgo and Akebo Yamasaki
1.
2. SMSAssassin: Crowd sourcing Driven Mobile-based System for SMS Spam Filtering - Kuldeep Yadav, Ponnurangam Kumaraguru, Atul Goyal, Ashish Gupta and Vinayak Naik
The Impact of Feature Extraction and Selection on SMS Spam Filtering Alper Kursat Uysal, Serkan Gunal, Semih Ergin, Efnan Sora Gunal
3.
4. SMS Spam Filtering Technique Based on Artificial Immune System - Tarek M Mahmoud1 and Ahmed M Mahfouz2 1 Computer Science Department, Faculty of Science, Minia University El Minia, Egypt 2 Computer Science Department, Faculty of Science, Minia University El Minia, Egypt 5. SMS spam filtering: Methods and data Sarah Jane Delany a, , Mark Buckley b, Derek Greene
6.Sampling Of MassSMS Filtering Algorithm Based On Frequent Time-Domain Area

XIA Hu, FU Yan
7.T. A. Almeida, J. Almeida, and A. Yamakami. Spam
Filtering: How the Dimensionality Reduction Affects the Accuracy of Naive Bayes Classifiers . JISA, 1(3):183200, 2011. 8. Cormack et al, Spam filtering for short messages, Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, November 06-10, 2007, Lisbon, Portugal . 9. Mobile SMS Marketing, (December, 2010),available: http://www.mobilesmsmarketing.com/live_exam ples.php 10. Domingos, P. 1999. Metacost: A general method for making classifiers cost-sensitive. Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining.

Wse''

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Wse''

Uploaded by

Copyright:

Available Formats

AN EFFECTIVE APPROACH FOR SMS SPAM FILTERING

CURRENT RESEARCH WORK:

6.Sampling Of MassSMS Filtering Algorithm Based On Frequent Time-Domain Area

You might also like