close
This essay has been submitted by a student. This is not an example of the work written by professional essay writers.

Malware Classification Using Machine Learning

downloadDownload printPrint

Remember! This is just a sample.

You can get your custom paper by one of our expert writers.

Get custom essay

121 writers online

blank-ico

Malware is commonly found a headache in almost all the mobile phones, laptops, memory cards etc. The most common technique used by malware to avoid detection is binary obfuscation either by either using encryption. One of the techniques used by malware to escape detection is binary obfuscation either by encryption(polymorphism) or metamorphic attacks (different code for the same functionality). For detecting them quickly and effectively, we should group them in accordance with their family. This gives rise to a growing need for automated, self-learning, fast and efficient technique that will be robust to these attacks. In this paper, we only intended to classify the malware into their respective families and not detect them (identify if they are malware or not). A criterion of 500 counts of an observed value is to be selected for our feature dataset which will be used by our machine learning algorithms. In this paper, we focus on novel data visualization techniques like image representation of the malware and classification based on Artificial Neural Networks and K-Nearest Neighbour.

Malware analysis is usually done as ‘Static Analysis’, ’Dynamic Analysis’ and also ‘signature-based’. In static analysis, the disassembly code files are analyzed for the malicious system calls. A model is to be built for the control flow graphs. Whereas, in Dynamic malware analysis technique, Data is analyzed in a controlled environment and it also traces (system logs). This mentioned process is extremely slow and also a resource and time-consuming. Both of the mentioned techniques work’s well, but static code analysis suffers from malware implementation differences, while dynamic malware analysis is limited to environment and the malware triggering conditions, and hence it is also a scalable option. To analyze the malware signature, the signature has to be built using N-Gram techniques. The malware disassembly is analyzed for the most repetitions of the op-codes, and N-Grams are to be built on top of that.

In order to visualize the data, we make use of malware visualization techniques. We will convert every malware bytes code to a grayscale image. Malware from different families have similarities in visual appearance, this is the basic principle which is followed. These images are to be used for image-based classification. OPCODE is to be calculated from disassembly code.

The purpose of this paper is to implement machine learning algorithms in order to classify the malware into their respective families. Data is to be taken from www.kaggle.com provided by Microsoft containing 10868 samples of malware belonging to total 9 different classes namely- files are from nine different malware families, namely Ramnit, Lollipop, Kelihos ver3, Vundo, Simda, Traceur,Kelihos ver1, Obfuscator.ACY, and Gatak respectively. The objective here is to analyze, visualize malware and to parse data beforehand. So the objective is to develop a new integrated model which takes advantage of all the models.

Problem Definition:

Extensive work has been done in terms of analyzing malware. Static, Dynamic and Signature-based malware analyzing techniques have been researched in many papers. A publication based on image-based malware visualization was one of the preferred way [1] it explains how to form an image out of binary malware files, how to visualize those images. In the alternate approach to extract data from disassembly code, which could be used for classification [2] data accuracy was not optimal. This paper suggests a way to extract novel features based on N-Grams, code sections and op-codes sequence and DLL calls. But even before we could develop signatures for malware certain tasks has to be carried out within the scope of Malware detection and classification.

Related Work:

There has been extensive work done on analyzing malware. Many papers are published which denotes Static, dynamic and signature-based malware analyzing techniques. A publication based on image-based malware visualization as one of the preferred way [1]. This paper explains how to form an image out of binary malware files, how to visualize those images. This, machines are used for image-based classifications. We also referred a paper which defines how to extract data from disassembly code, which could be used for classification.[2] This paper suggests a way to extract novel features based on N-Grams, code sections and op-codes sequence and DLL calls. But even before we could develop signatures for malware certain tasks has to be carried out within the scope of Malware detection and classification.

Analysis: We have studied a few papers which use the same principles as ours to classify the malware into their families. It has been observed that in case of missing data multi-layer perception model and logic regression is good.

Image Visualization Techniques were used which gave an average predictive accuracy of 95% using Deep Neural Network. We also found that the methodology gives optimal result when compared to other techniques available. Whereas Machine Learning based Malware Classification for Android applications using Multimodal Image Representations [3] is a bit slow when it comes to data processing.

Proposed Methodology:

To analyze the signature, the signature is built from N-Gram techniques. The malware disassembly is analyzed for the most repetitions of the op-codes, and N-Grams are built on top of that.

We propose to use malware visualization techniques. We aim to convert every malware bytes code to a grayscale image. In research and analysis, it was observed that malware from different families has similarities in visual appearance, presenting to us an opportunity to exploit this weakness. These malware images will be used for image-based classification. From the disassembly code, we will compute the OP-CODE counts, DLLs and section count from assembly codes provided. The top features parsed out of all the assembly files were used for classification of the malware. A criterion of 500 counts of an observed value is to be selected for our feature dataset which will be used by our machine learning algorithms.

These different parsed data sets would be used for classification, done on MATLAB’s machine learning toolbox. In this paper, we have described the data visualization methods, parsing, selection of classification algorithms, and the outputs obtained.

Data Visualization:

As suggested in malware image generation and classification technique, every byte of data is converted into a grayscale pixel. The array or the byte stream was converted into an image [1]. An image representation of the malware, produce very convincing images of malware. The polymorphic malware appears similar with a slight deviation of the code segments. Segments/fragments from the binary file. From the figure (3), the parts of the sections can be observed. The .text segment contains all the code and the zero padding.

The .rdata segment contains all the constants. The .data segment contains all the data, which is initialized. The .Rsrc section involves icons of the files. The author of [1] also managed to obtain various pictures from this segments on his test data. In our observation, we did not get any icons from/for the malware. Every class of data provided a unique texture to the image, which helped features selection.

Data Parsing:

The malware binary files will be converted to a grayscale image. Since these images are variable and very high at dimensions. We at first will shrink the images in a direction to a constant small dimensional image. We will also use python’s numpy library for the image formation and reduce the dimension size. One these small images, we computed gist to summarize the gradient information of the images. Gist computes Gabor filter’s which are similarity measures of texture in images and as malware produced, similar textures in images, these features could be used as data. From the disassembly, extracted are all the lines of code segments, all the op-codes and the DLL calls. After the extraction, the summation is to be performed on all the columns to see the highest valued features. Top valued 321 frequencies as features are to be selected. This model is hybrid data retrieval model [4] We will obtain an open BSD licensed parser written in python code which does this same.

Proposed results and Analysis:

We will try running ANN with Various Different Neighbouring values and distances. For this, we will use both the data files we obtained, i.e. one from the disassembly file and one from the images Fig. 3. ANN tuning on disassembly data [1]. ANN tuning on image data [1] We compiled the same on MATLAB’s toolbox for 10 folds, and the results obtained were as follows: We wrote a MATLAB code to analyse the performance of KNN algorithm for different Neighbour values and the distances From the tuning data obtained, we got least objective function value for Cityblock, and with distance 1, on image data. From the code, we developed, and we got 91.26% accuracy. It is very close to the output we obtained from machine learning toolbox on Matlab. From the tuning data obtained, we got least objective function value for Sherman, and with distance 1, on .asm data. From the code, we developed, and we got 98.8% accuracy. It is also very close to the output we obtained from machine learning toolbox on MATLAB.

We will be able to understand the functioning of different machine learning algorithms and also deduce which one of them produces optimal results.

We will also learn novel data extraction techniques to convert malware files into grey scale images and classify them into their respective families based upon their signatures. This Project would be very beneficial for AV vendors as it is a self-learning and automated classification process.

Amongst the classifiers mentioned in this project, we hope to achieve optimal results from ANN. We will study analyze and compare two machine learning algorithm techniques i.e. ANN and KNN algorithms. The desirable result’s to be expected from ANN would be around 95% in terms of accuracy. We plan to achieve better results using ANN using Novel data extraction techniques. This is due to the fact that images obtained from the malware files from the same family are so similar that the distance between them is similar. In case of greater distance files, different families are considered. We are also hopeful to get a detection rate of 91% along with False- positive rate of 0.1%. Apart from this, our approach needs modest computation to perform and also to analyze. It also comes to attention that dataset can be trained to automatically classify malware into their respective families on the basis of the given or self-defined parameters.

We plan to implement Xgboost and ensembles to combine the result of different models we tried, which gave best results. Also, there is a possibility to work around on the dis-assembly code and simulate the malware in a controlled environment. We can collect the system calls and the logs could be used as another data set. With all, this could be combined with the other models for the ensembles. Also, we can try to extract N-Grams from the hex data, and this could be combined with data obtained from dis-assembly code to build a training data set. As per the publications, this should produce better results.

Remember: This is just a sample from a fellow student.

Your time is important. Let us write you an essay from scratch

experts 450+ experts on 30 subjects ready to help you just now

delivery Starting from 3 hours delivery

Find Free Essays

We provide you with original essay samples, perfect formatting and styling

Cite this Essay

To export a reference to this article please select a referencing style below:

Malware Classification Using Machine Learning. (2018, April 16). GradesFixer. Retrieved September 23, 2022, from https://gradesfixer.com/free-essay-examples/malware-classification-using-machine-learning/
“Malware Classification Using Machine Learning.” GradesFixer, 16 Apr. 2018, gradesfixer.com/free-essay-examples/malware-classification-using-machine-learning/
Malware Classification Using Machine Learning. [online]. Available at: <https://gradesfixer.com/free-essay-examples/malware-classification-using-machine-learning/> [Accessed 23 Sept. 2022].
Malware Classification Using Machine Learning [Internet]. GradesFixer. 2018 Apr 16 [cited 2022 Sept 23]. Available from: https://gradesfixer.com/free-essay-examples/malware-classification-using-machine-learning/
copy to clipboard
close

Where do you want us to send this sample?

    By clicking “Continue”, you agree to our terms of service and privacy policy.

    close

    Be careful. This essay is not unique

    This essay was donated by a student and is likely to have been used and submitted before

    Download this Sample

    Free samples may contain mistakes and not unique parts

    close

    Sorry, we could not paraphrase this essay. Our professional writers can rewrite it and get you a unique paper.

    close

    Thanks!

    Please check your inbox.

    We can write you a custom essay that will follow your exact instructions and meet the deadlines. Let's fix your grades together!

    boy

    Hi there!

    Are you interested in getting a customized paper?

    Check it out!
    Don't use plagiarized sources. Get your custom essay. Get custom paper
    exit-popup-close

    Haven't found the right essay?

    Get an expert to write you the one you need!

    exit-popup-print

    Professional writers and researchers

    exit-popup-quotes

    Sources and citation are provided

    exit-popup-clock

    3 hour delivery

    exit-popup-persone