Home — Essay Samples — Information Science and Technology — Artificial Intelligence — Review Paper For Spam Detection URL And Image Spam Filtering Using Machine Learning

Review Paper for Spam Detection Url and Image Spam Filtering Using Machine Learning

Categories: Artificial Intelligence

About this sample

Words: 2497 |

Pages: 5|

13 min read

Published: Jul 30, 2019

Words: 2497|Pages: 5|13 min read

Published: Jul 30, 2019

Introduction
Related works
Proposed Idea

Idea 1: Pseudo-OCR for Image Spam Filtering

Idea 2: Key-Point Based Character Feature

Idea 3: Image spam ﬁltering

Idea 4: Detecting Spam URL using SVM algorithm

Conclusion

The increasing volume of pernicious content in social media requires automate methods to detect and eliminate such content. This paper describes a superintend machine learning classification model that will be built to detect the distribution of pernicious content in online social networks/medias (ONSs/OMSs). Multisource features have been used to detect social network posts that contain vitriolic Uniform Resource Locators (URLs). These URLs could direct users to websites that contain maligiant content, drive-by download attacks, phishing, spam, and scams. For the data collection stage, the Twitter streaming application programming interface (API) was used and VirusTotal was used for labelling the dataset. A random forest classification model was used with a combination of features derived from a range of sources. The fraudulent practice of sending emails is a criminal scheme to get the user’s personal data and other login and confidential information. It is known as phising that acquires users private information such as password, bank account detail, credit card number, financial username and password etc. and later it can be mistreat by attacker.We aim to use fundamental visual features of a web page’s appearance as the basis of detecting page similarities. We propose a novel solution, to efficiently detect phishing web pages. Note that page layouts and contents are fundamental feature of web pages’ appearance. Since the standard way to specify page layouts is through the style sheet (CSS), we develop an algorithm to detect similarities in key elements related to CSS. In this paper, we proposed a system that uses SVM technique along with Image Spam Filtering, spam mapreduce archetype to achieve a higher accuracy in detection of the spam urls and iamge spaming. After further investigation and applying parameter tuning and feature selection methods, however, we were able to improve the classifier performance.

Introduction

The main challenges for social network security administrators are not only protecting the social network management system and database, but also protecting OSN users from being exposed to malicious content that is spread over those social networks. 60% of social network users have received or been exposed to malicious content such as spam, scams, and drive-by downloads. A number of OSNs are now developing malicious content detection systems for such attacks e.g. the Facebook Immune System detects suspicious activities such as like-jacking, social bots, and fake content.

Malicious URL is a URL created with malicious purposes, among them, to download any type of malware to the affected computer, which can be contained in spam or phishing messages, or even improve its position in search engines using Blackhat SEO techniques.

Smart Malicious URL Detection System is an anti-phishing technique to safeguard our web experiences. Our approach uses the Chinese Image Spaming Lexical features, Host based features and site popularity features of a website to detect any suspicious or phishing website. These features are obtained from the source code by taking URL as input and then these features are fed to the classifier algorithm. The results obtained from our experiment shows that our proposed methodology is very effectual for preventing such attacks and the performance was measured by using Confusion Matrix for all the classifiers.

Related works

The majority of studies in this area aim to find the most predictive features that they can acquire and the best algorithm to develop a classifier model. Researchers in this field focus mainly on finding novel features with high discriminative power in addition to coming up with the most accurate machine learning model. Finding high discriminative features in the area of Internet security and social networks is quite a challenge due to the variation in attacks and techniques used by spammers. Due to the inventiveness of spammers detection systems are bypassed after some time and the set of features used for spam detection has to be regularly revised. Similar to how security researchers study the attacks, spammers and hackers investigate detection systems; therefore, they can change user properties, content or the distribution mechanism to bypass certain restriction or detection rules. For example, a study of detecting spam on Twitter recommended that the number of followers is one of the highest discriminative power features. The feature’s discriminative power has been increasingly weakened though by spammers making their accounts more popular. They do this by conducting spam campaigns that make their “fake” accounts connect with other fake accounts, increasing the follower and following numbers.

Burnap et al. used an entirely different method to detect malicious URLs. They deployed a high-interaction honey-net2 to collect system state changes, such as the sending/receiving packets and CPU usage. The training dataset contained 2,000 examples with a 1:1 ratio for spam/non-spam. Ten attributes were used to build a classifier that reflected system status changes after opening the tweet's URL. Burnap et al. investigated the shortest time required to give a preliminary warning of the existence of malicious content in a particular URL. The best result was reported for Multilayer Perceptron (MLP) using features acquired after 210 seconds (0.723 in the F-measure metric). The features used by Burnap et al. require complex data analysis; however, they make it difficult for spammer sites to disguise their true nature. Although the recent literature has compared several algorithms, there is a lack of information about important stages in building a machine learning model. In particular, little information is provided about how feature selection methods are managed and how parameter tuning is conducted. We address this issue in section IV.

Also In this paper,author introduce a method which is combination of fingerprint technique and big data processing to detect the spam emails. Support Vector Machine (SVM) is the machine learning technique that is used for spam filtering. SVM training is a very large process so to deal with this MapReduce platform for spam filter training was used.In this paper the author used a content based spam filtering. The email classification as spam or ham is based on the data that is present in the content of the mail. So the header section is ignored in case of content based spam filtering. This paper specifically includes the comparison between implementations of Fisher-Robinson Inverse Chi-Square Function, implementation of AdaBoost classifier and KNN classifier.

Proposed Idea

This section describes in detail the main stages of this study, starting with the data collection and labelling of the dataset, followed by a brief comparison of the most common techniques used in related studies. The main purpose of the system is to not only protect the social network management system and database, but also protect OSN users from being exposed to malicious content that is spread over those social networks as many of social network users have received or been exposed to malicious content such as spam, scams, and drive-by downloads.

Idea 1: Pseudo-OCR for Image Spam Filtering

The image spam manufacturing technology makes image spams more similar to the harm ones, thus more diﬃcult to identify directly from image features without any content information. What’s more serious, for some advanced applications, the spam image ﬁltering process actually requires more contextual information than a simple ﬁltering result. Hence we believe, it is essential for an anti-spam system to obtain extent content information of current image which apparently could only be obtained through long-established OCR based methods. While, as the discussed disadvantages mentioned above, traditional OCR is not our best choice. So, the idea of pseudoOCR is proposed to avoid such defects while still be able to extract enough content information. Compared with long-established technology, our proposed pseudo-OCR exhibits the following improvements for Chinese image spam ﬁltering. Firstly, pseudo-OCR has a more approachable requirement for character reader. Determining whether or not a given character feature belongs to spam image rather than recognizing it is suﬃcient. Secondly, pseudo-OCR can eﬀectively process a much wide range of images, even the ones with complex background and human interferences which are usually diﬃcult to handle for traditional OCR based methods. And last, for Chinese character recognition, the proposed pseudo-OCR generates template features from certain training images instead of a set of standard Chinese characters.

Feedback, gives the system learning ability to keep a proper high performance for a long period. It is well known for anti-spam communities that spammers tends to modify their image spam templates over time, which would result in an inevitable degradation of performance for near-duplicate based methods. Although our proposed methods is not strictly near-duplicate based, it adopts the similar methodology for extracting template character features from some known spam images. To handle such foreseeable defect, feedback mechanism is introduced in our system. By using detected spam as an additional source of template characters, it is very much possible to replace the obsoleted template character features with new ones, therefore to sustain a better performance.

Idea 2: Key-Point Based Character Feature

To meet the requirements of pseudo-OCR, the Chinese character feature extracted should also be modiﬁed. Concerning only certain key-points of a character, we devised a novel character feature, which probably fails to be used for traditional character recognition yet suﬃcient to reserve enough content information for pseudo-OCR. The core of extracting such feature is a two-phase procedure. During the ﬁrst phase the key-points and their connectivity information are extracted and stored as adjacency matrix using a DFS based algorithm, then the actual feature is calculated from this adjacency matrix in the second phase. For identifying image spams using this feature, every character feature extracted from a given image is compared with the template ones to determine its category information ﬁrst, then the distribution of all these characters’ category information is used for the ﬁnal judgement.

Idea 3: Image spam ﬁltering

From feature extraction described above, any input image will be converted into a set of 20-dimension key-point based character features. To use such features for image spam ﬁltering, whose category information has to be obtained ﬁrst. For a given character feature, the minimal L1 distance of those between it and all template features is calculated and compared with a certain threshold to determine its category. Here, this threshold is named category threshold to distinguish which with the following predeﬁned threshold. Given all the category information of character features of an image, the distribution of such is used to make the ﬁnal judgement. Because all the template character features in our implemented system fall into two categories, spam or ham. Then, by comparing the spam feature proportion with a predeﬁne threshold calculated during the training process to choose the minimal spam image feature ratio of all training spam images, we are able to determine whether or not it is a spam image. In our system, a 0.25 minimal threshold is picked out of total 82 training spam images. Experiment results show that our proposed Chinese image spam ﬁltering system using pseudo-OCR usually achieves a better performance when compared with traditional OCR based method.

Idea 4: Detecting Spam URL using SVM algorithm

An identity theft that occurs when a malicious web site masquerades a legitimate one is called Phishing. Such a theft occurs in order to procure sensitive information such as passwords, bank account details, or credit card numbers. Phishing makes use of spoofed emails which look exactly like an authentic email. These emails are send to a bulk of users and appear to be coming from legitimate sources like banks, e-commerce sites, payment gateways etc. Machine learning is a subset of artificial intelligence in the field of computer science that often uses statistical techniques to give computers the ability to learn.Therefore it turns out to be most critical to build up a quick and exact phishing recognition tool. Statistics about phishing activity and phishtank usage The GUI of our phishing detection system engages end users and provides them an environment for detecting malicious sites The whole discussion is to provide a user-friendly and effective, efficient way to prevent the internet users from phishing attacks and protect them from malicious sites.

We propose a url based phishing detection system using lexical features,site popularity features and host based features by using the algorithms like ANN,K Nearest Neighbours Classifier, Support vector machine (SVM) classifier,Logistic Regression, Decision Tree, Bagging classifier, Random forest, Gradient Boosting Classifier.

We have implemented a Malicious URL based phishing detection system for end user where the GUI of our system engages end users and provide user friendly experience.The System Analyzes Uniform Resource Locator (URL) itself without accessing of Web sites.Also there exists no runtime idleness.

For our future work we aim to develop a browser plug-in which can work online. Besides, we aim to incorporate different parts of web based learning and assembling information to see the new patterns in phishing exercises.

Conclusion

In order to extract enough content information and avoid the defects of traditional OCR based methods, we propose the idea of pseudo-OCR, which reserves the structure of traditional OCR yet with a looser recognition requirement, the ability to process a wider range of input images and a more data oriented template feature generating mechanism. Furthermore, a pseudo-OCR based Chinese image spam ﬁltering system with automatic learning ability is proposed in this paper. In the implementation part, we also create a novel Chinese key-points based character feature suitable for pseudo-OCR. Which, we believe, could also been used in applications like image spam clustering. By measuring comprehensive performance of proposed system concerning precision, recall and false positive rate, analysis of inﬂuence caused by two important perimeters is also conducted. Not only is optimization of our implemented system accessible, but also such analysis could provide instructive information if application circumstance changes. And by comparing with a traditional OCR based image spam ﬁltering method, the experiment shows that our proposed method obtains a much better performance than the traditional OCR based methods and has the potential for practical use.

We also propose a url based phishing detection system using lexical features,site popularity features and host based features by using the algorithms like ANN,K Nearest Neighbours Classifier, Support vector machine (SVM) classifier,Logistic Regression, Decision Tree, Bagging classifier, Random forest, Gradient Boosting Classifier.

We will also implement a Malicious URL based phishing detection system for end user where the GUI of our system engages end users and provide user friendly experience.The System Analyzes Uniform Resource Locator (URL) itself without accessing of Web sites. Also there exists no runtime idleness. For our future work we aim to develop a browser plug-in which can work online. Besides, we aim to incorporate different parts of web based learning and assembling information to see the new patterns in phishing exercises.

This essay was reviewed by

Alex Wood

More about our Team

Cite this Essay

Review Paper For Spam Detection URL And Image Spam Filtering Using Machine Learning. (2019, July 10). GradesFixer. Retrieved April 26, 2024, from https://gradesfixer.com/free-essay-examples/review-paper-for-spam-detection-url-and-image-spam-filtering-using-machine-learning/

“Review Paper For Spam Detection URL And Image Spam Filtering Using Machine Learning.” GradesFixer, 10 Jul. 2019, gradesfixer.com/free-essay-examples/review-paper-for-spam-detection-url-and-image-spam-filtering-using-machine-learning/

Review Paper For Spam Detection URL And Image Spam Filtering Using Machine Learning. [online]. Available at: <https://gradesfixer.com/free-essay-examples/review-paper-for-spam-detection-url-and-image-spam-filtering-using-machine-learning/> [Accessed 26 Apr. 2024].

Review Paper For Spam Detection URL And Image Spam Filtering Using Machine Learning [Internet]. GradesFixer. 2019 Jul 10 [cited 2024 Apr 26]. Available from: https://gradesfixer.com/free-essay-examples/review-paper-for-spam-detection-url-and-image-spam-filtering-using-machine-learning/

copy

Keep in mind: This sample was shared by another student.