This essay has been submitted by a student. This is not an example of the work written by professional essay writers.

Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection

downloadDownload printPrint

Remember! This is just a sample.

You can get your custom paper by one of our expert writers.

Get custom essay

121 writers online


The main focus of this venture is an overview of machine learning and data mining strategies for cyber analytics in aid of intrusion detection.ML helps the computer to determine without being exactly programmed whereas DM explores the earlier important and unimportant properties of data.

Cyber Security

It is formulated to secure Pcs, networks, programs and data from external and internal attacks or unapproved access. Cyber security includes: Firewall, Antivirus software, and an Intrusion Detection System (IDS). IDS help in recognizing unapproved access. Three principles of cyber analytics in aid of IDS: misuse-based, anomaly-based, and hybrid.

  • Misuse-Based are effective systems intended to identify known attacks however they can’t recognize zero day or novel attacks but generate least false rate.
  • Anomaly-Based to figure out deviations from standard practices moreover these practices are tailored for every system, it also helps to figure out zero day or novel attacks.
  • Hybrid Systems incorporate misuse and anomaly detections, they are utilized to boost detection rate and decline False positive (FP) rates for obscure attacks.

Adding on Network allocated IDS and Host allocated IDS. Network IDS analyzes interference by observing movement through network devices whereas Host IDS supervises process and file activities.In order to approach ML/DM, three ways used are: unsupervised, semi-supervised, and supervised. Unsupervised approach involves the fundamental task to figure out designs and structures, whereas Semi-supervised approach involves naming and securing of data by specialists to solve the problem. Lastly in Supervised approach the data are finally labeled to find a prototype that elaborates the data.

ML involves three main operations: training, validation, and testing. Moreover, the operations that usually performed are:

  1. Analyzing the properties from training data.
  2. Analyzing the dimensional reduction.
  3. Determining the prototype utilizing training data.
  4. Using trained prototype to specify unknown data, to get the result unambiguously

DM involves six main operations:

  1. Defining the problem of Data
  2. Preparing the data
  3. Exploring the data
  4. Modeling and evaluating the model
  5. Development and updation of the data

The following Crisp-DM Model elaborates the above operations to solve DM problems

Business understanding helps to define the DM issue whereas Data understanding gathers and examines the data. The next phase, Data preparation plans to reach the last information. In Modeling, DM and ML strategies are applied and improved to fit best model. Furthermore, the evaluation phase evaluates the strategy with proper measurements whereas deployment varies from presenting an answer to a full execution of the information. Lastly the data investigator connects the stages until arrangement, while the client plays out the sending stage.

Cyber-security data sets for ML and DM

This part focuses on various types of data for ML and DM approaches such as: Packet Level Data, NetFlow Data, and Public Data sets.

  • Packet Level Data: Almost 144 IPs are recorded by the Internet Engineering Task Force (IETF) which are broadly used among protocols. The purpose of these protocols is the transference of bundles throughout the network. Moreover, these network bundles are transferred and acknowledged at a physical interface which can be occupied by API (Application Program Interface) in PCs, also known as pcap.
  • NetFlow Data: It is recognized as a router highlighted by Cisco. Version 5 of Cisco’s NetFlow bundles flows in one direction. The aspects of the bundle are: ingress interface, source IP address, destination IP address, IP protocol, source port, destination port and type of services.
  • Public Data Sets: Experiments and publications have the data sets provided by the Defense Advanced Research Projects Agency (DARPA) in 1998 and 1999 that has basic aspects occupied by pcap. DARPA discovered four types of attacks in 1998: R2LAttack, U2R Attack, DOS Attack, Probe or Scan.

ML and DM procedures for cyber

Cyber Security for ML and DM includes the following procedures:

  1. Artificial Neural Network:
  2. It contains a network of neurons in which output of one node is the input of another. ANN can also act as a multi-divisional classifier of intrusion detection I.e.: Misuse, hybrid and anomaly detection. The main 9 factors of data processing stage are: protocol ID, source address, destination address, source port, destination port, ICMP code, ICMP type, raw data and data length.

  3. Association Rules and Fuzzy Association Rules:
  4. Former rule tells how frequent a given relationship appears in the data whereas latter rule contains numerical and categorical variables.

  5. Bayesian Network:
  6. It’s a graphical model that represents the variables and the relationships between them. The network is made-up with nodes as the discrete or continuous random variables to form acyclic graph.

  7. Clustering:
  8. It is an arrangement of procedures for discovering designs in high-dimensional unlabeled information. One of the major purposes of clustering in intrusion detection is that it obtains audit data except explicit descriptions provided by the system administration.

  9. Decision Trees:
  10. A decision tree looks like a tree, representing its groups and branches, which in turn represent the combinations of elements that lead to those groups. A model is designated by testing its elements against the nodes of the decision tree. To build decisions spontaneously, ID3 and C4.5 algorithms

    are used. Some of the major advantages includes Decision trees are impulsive expression, precise classifications, and basic implementation. Adding on its disadvantages, data includes sequential variables with a different number of stages.

  11. Ensemble Learning:
  12. Ensemble process incorporate several concepts and tries to formulate the ideal concepts compared to the previous ones. Usually, ensemble methods use several weak learners to build a strong learner. Boosting is one the methods of ensemble algorithms to educate multiple learning algorithms. Some of the popular algorithms includes: Bagging is a technique to enhance the consensus of the predictive model to decrease over-fitting. It is based on a model-averaging technique and known to enhance the 1-nearest neighbor clustering performance.The Random Forest classifier is an ML technique that incorporates the ensemble learning and decision trees. The input’s attributes are picked up indiscriminately and the variance is controlled. Several advantages of Random Forests include: a less number of control parameters and retaliating to over-fitting; no need of attributional selection.

    Adding on another advantage to Rando, Forest is that there is an inverse relationship between the model and the number of trees in the forest. Random Forests also have some disadvantages such as the model has low intractability. This activity also has a loss due to connected factors and its dependence on the random generator.

  13. Evolutionary Computation:
  14. Evolutionary computation involves six major algorithms i.e: Genetic Programming, Genetic Algorithm, Ant Colony Optimization, Artificial Immune Systems, Evolution Strategies and Particle Swarm Optimization. This subdivision highlights two main commonly used practices—GA and GP. They are both based on the principles of survival of the fittest. They are evolved around on a population of individuals that are using specific operators. Commonly used operators are selection, crossover and mutation.Genetic Algorithm and Genetic Programming are distinguished by how individuals represent each other. GA is expressed they as bit strings and basic crossover and mutation operations. are very simple whereas GP expresses programs and it also represents trees alongside operators such as addition, subtraction, multiplication, division, not, or. The crossover and mutation operators in GP are much complicated than those used in GA.

  15. Hidden Markov Models:
  16. A Markov chain is an arrangement of states that links the change in probabilities, deciding the model topology. The framework being demonstrated by HMM is thought to be a Markov procedure with obscure parameters. In this illustration, each host is mentioned by its four states: Probed, Good, Attacked, and Compromised. The edge starting from one nod to another depicts the source and destination of state.

  17. Inductive Learning:
  18. In order to deduce information from data, two practices are involved i.e. deduction and induction. Deduction interprets through a logical sequence presenting the data from top to down whereas inductive reasoning opposes the deduction reasoning as it moves from the bottom to top. In inductive learning, one begins with particular perceptions and measures, starts to recognize examples and regularities, details nearly provisional speculations to be investigated, and ultimately winds up building up some broad conclusions or hypotheses. One of the important observations by the researchers is that the ML algorithms are inductive but mostly they are referring to Repeated Incremental Pruning to Produce Error Reduction (RIPPER) and the algorithm quasi-optimal (AQ). RIPPER involves regimen that uses separate-and-conquer approach. It obeys one rule at a time to covers a maximum set of examples in the current training set.

  19. Naive Bayes:
  20. Naïve Bayes classifier mostly follows the Bayes theorem. The name is derived from the fact that the input features are independent as its decreases high-dimensional density estimation task to a one-dimensional kernel density estimation. Naïve Bayes classifier has many restrictions as it is an optimal classifier because of its independent features. Naïve Bayes classifier is an online algorithm which fulfills its training in a linear time considering to be one of the major benefits to Naive Bayes.

  21. Sequential Pattern Mining:
  22. Sequential Pattern Mining Sequential is essential to DM methods with an approach of transactional database with temporary IDs, user IDs and an itemset. An itemset is a binary representation in which an item was or was not achieved. A sequence is a systematized list of itemset. The number of itemset in a sequence defines its length whereas its order is obtained by the time ID. Suppose a Sequence A having length n is in another sequence B of length m due to which all the itemset of A are the subsets of B itemset. Whereas the itemset in Sequence B that are not a subset of an itemset in A, are allowed. Now if considering a database D containing sequences having the variable p and if one of the sequences of D(p) contains A, then A must support D(p). A large sequence should have a minimum threshold. So, finding the maximum sequences is the major problem in sequence mining.

  23. Support Vector Machine:
  24. In order to maximize the distance between the hyperplane and the closest data points of each class SVM acts as foundation of the hyperplane .The approach depends on a limited order risk as opposed to on ideal order. SVMs principles are more helpful when the number of features is higher than number of data points. There are multiple classification surfaces such as hyperbolic tangent, Gaussian Radial Basis Function, linear and polynomial.

Factors Affecting the Computational Complexity of ML and DM Methods

The major three factors that affect ML and DM computational complexity are: Time complexity, incremental update capability, and generalization capacity.

In order to increase their capability clustering algorithms, statistical methods, and ensemble models can easily be updated sequentially.

A decent abstraction measure is required so that the sample model does not radically decline from the beginning model. The vast majority of ML and DM techniques have great speculation capacity.

On concluding, we examine that ML and DM techniques are utilized for Cyber Security however different ML and DM systems in the cyber domain can be used for both Misuse Detection and Anomaly Location. There are few quirks to this issue that make ML and DM techniques harder to utilize as they particularly identify how frequently the model should be retrained. In most ML and DM applications, a model is prepared and afterwards utilized for quite a while with no variations in it.

Remember: This is just a sample from a fellow student.

Your time is important. Let us write you an essay from scratch

experts 450+ experts on 30 subjects ready to help you just now

delivery Starting from 3 hours delivery

Find Free Essays

We provide you with original essay samples, perfect formatting and styling

Cite this Essay

To export a reference to this article please select a referencing style below:

Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection. (2019, March 12). GradesFixer. Retrieved September 28, 2022, from
“Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection.” GradesFixer, 12 Mar. 2019,
Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection. [online]. Available at: <> [Accessed 28 Sept. 2022].
Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection [Internet]. GradesFixer. 2019 Mar 12 [cited 2022 Sept 28]. Available from:
copy to clipboard

Where do you want us to send this sample?

    By clicking “Continue”, you agree to our terms of service and privacy policy.


    Be careful. This essay is not unique

    This essay was donated by a student and is likely to have been used and submitted before

    Download this Sample

    Free samples may contain mistakes and not unique parts


    Sorry, we could not paraphrase this essay. Our professional writers can rewrite it and get you a unique paper.



    Please check your inbox.

    We can write you a custom essay that will follow your exact instructions and meet the deadlines. Let's fix your grades together!


    Hi there!

    Are you interested in getting a customized paper?

    Check it out!
    Don't use plagiarized sources. Get your custom essay. Get custom paper

    Haven't found the right essay?

    Get an expert to write you the one you need!


    Professional writers and researchers


    Sources and citation are provided


    3 hour delivery