About this sample
About this sample
Words: 2272 |
12 min read
Published: Mar 28, 2019
Words: 2272|Pages: 5|12 min read
Abstract— It is assumed as a difficult mission to relate everything on the earth by means of web, but Internet of Things will immensely change our life’s by agreeing to disagree. The amount of data that is being extracted from Internet of Things tends to be very valuable data. Data Mining can be used to reduce the complexity and improve the efficiency of IoT and Big data. These days, the technique of data mining are equal to some machine learning techniques that uses algorithms in order to detect some hidden events in big data. In this paper some data mining techniques and algorithm are discussed that are used in huge amount of data of IoT, also the related and future work in data mining which are executed with Apche Spark and Hadoop, using MapReduce framework.
Data mining refers to discovering of patterns in large data sets. With the fast development of the emerging applications like web analysis, community network analysis, there is a drastic increase in the data that is being processed. Internet of things is a new phenomena which provides access to the user to connect sensors and other devices to collect real time data from the environment. Big data is a huge amount of data that is extracted from IoT and is applied to the data that can’t be handled using customary devices. Recently, The Internet of things have gained a rapid advancement where it can control the survey each object on the earth by means of Internet.Data mining is the process of extracting useful and valuable data from large databases for innovating pattern mining which is also referred as Knowledge Discovery in Databases.
Data Mining overrides with other fields like machine learning, statistics, AI but mainly it works with automation of carrying huge amount of information, algorithm and various number of instances. The above figure illustrates the process of data mining which includes Selection, Pre-processing, Transformation, Data mining and Interpretation Evaluation. As the number of web applications are increasing the sensor related information is also being in-creased in the Internet of Things. Due to this enormous expansion of sensors there is a rise in the problem of data handling which comes out to be one of the important issue in IoT framework application. This huge amount of data needs to be filtered and cleaned out so that it is sighted by the user and can be then collected in the form of patterns. Data mining can be stated as a sensible approach to look through huge amount of data and extract valuable information from it.Since today the pattern finding approach was not completely utilized and so the extracted information was static in database. But as the new method of finding pattern emerged, there has been an increment in the usage of information as well which leads to the improvement of business and community aspects. The question that is raised now is, how to convert the data that is extracted from IoT into some valuable knowledge. In daily routine everyone of us have become reliable on IoT with the handy technologies, instruments etc. As the IoT is integrated with the networks, it provides us with a clear intension of reducing the complexity of monitoring the things around which will provide us with huge amount of data. Data mining is used here to make this IoT more intelligent which requires lots of data analysis .
Big data deals with complex large volume of data sets with their multiple sources. As the development of data storage and networking technologies are increasing rapidly, there is a rapid expansion of Big Data in the field of science and engineering and also in biomedical sciences. At present many industries can use Big Data to get valuable data .
There is a need of intensive computing systems and units in the data mining system to gain accuracy in examining the data stored in the systems. So, to achieve this they work with two resources, the processors of the system and the information. The algorithms for data mining are redesigned so they can used to collect data from different sources in the system and parallel mining processes can be implemented. Algorithms such a K-means, parallel classifier, parallel association rule mining are all used in the processing of distributed data.
The process of Big Data can be categorized in three parts,
(1) Data Privacy and knowledge
(2) Algorithms for Big Data Mining
(3) Data Access.
3) The shuffle function: The resulted output of map function is sent to the reduce step where it is assigned with new key value and the data with same key are moved to worker node.
As the volume of data is increasing the complexity of data mining algorithm is also on the run, complexity of machine learning, statistics increases because of the large datasets. MapReduce Algorithm, Apache Spark annd Apache Hadoop are the methods that have been implemented for Big Data.
MapReduce is a programming model or a framework that is used in implementation an processing of generating big data sets with both parallel and distributed algorithms on a cluster. Feng Li et al came out with wide range of proposals and process that focuses on distributed data management and implementation using MapReduce framework. The two MapReduce kernel elements that are used in programming model are the mappers and reducers. Map function gener-ates temporary key/value pairs whereas the reduce function combines the values of the key. MapReduce makes sure that every job node of map and reduce are not dependent on their parallel job nodes that operates using different data and key .
Map(), shuffling() and Reduce() are the main function of MapReduce.
1) Map(): Map function works with local data of worker node or the map and generates an output data in the tem-porary storage containing key/value pairs like (k1, v1),(k2, v2)... Master node is used to combine all the output key value.
2) Reduce Function: This function processes node data in parallel with other node data and performs appropriate reduce job on it which runs only once on every k2 value.
B. Apache Spark
Apache Spark is an open source cluster framework and is built as a section of Hadoop architecture. Spark is the first distributed framework that supports the use of programming language in processing and computing big data on cluster nodes. There are three components that are implemented in spark that are spark context, parallel operation, resilient distributed datasets.
1) Resilient Distributed datasets:: Resilient distributed Datasets are termed as Apache Sparks data abstraction, and the features they hold are responsible for their significant accuracy. RDD is a huge collection of objects that are separated and dispersed to Hadoop cluster nodes by spark. RDD are read only datasets, languages such as Java, Scala etc are used to demonstrate the RDD object of Spark. The object of RDD can be implemented in three ways, by storing in HDFS, by using SparkContextParallelize() method, by using flatMap or mapcomputing transformation operations. The storage level of RDD are DISK ONLY, MEMORY AND DISK, MEMORY ONLY 2 etc.
2) SparkContext: : A SparkContext is a connection to a Spark cluster, which is used to create RDDs and send variables on that cluster. It can be developed using various cluster managers like Mesos,Yarn or it can use its own cluster manager.
3) Parallel Operation: : The parallel operations that are performed on RDD by spark includes Transformations and Actions. Transformations create a new RDD by transferring every element of RDD to a function that i user defined by performing some operation. Action provides with the desired result given to the driver program.
C. Apache Hadoop
. Collection of open-source programming models is a process done by Apache Hadoop that is encouraged using a network consisting of various networks from many comput-ers to solve the problems that includes huge amount of data. Processing of big data using the MapReduce programming model is proposed by Apache Hadoop.
Anita Brigit Mathew et al proposed a new indexing design approach which is names a LIndex and HIndex which provides support to the HDFS index and MapReduce system without altering the existing Hadoop framework. In every big company data plays a very important role, as it take a lot of time to process the data in real-time as well as in historical context.To overcome this issue the researchers came out with some solution which showed the performance comparison of solution for data in Hadoop. It provided a wide area for researches to take the work to a great extent of Big Data. Hadoop is divided into two parts HDFS and MapReduce Framework.
JobTracker submits the work. TaskTracker is monitored. When a task fails the TaskTracker notifies the JobTracker. When task Completed JobTracker updates the status. Client Application now gets the information from Job Tracker.
Internet of Things rapidly gained its popularity in recent years. It can control, identify each and every thing on the earth with the help of Internet. The concept of IoT was invented in 1999 by Kevin Ashton.
1) HDFS: : HDFS is mainly used to increase the number of hadoop clusters to hundreds and thousands of nodes. The huge amount of data in cluster nodes node is partitioned into number of small pieces which is called as blocks and it is spread in all other nodes. The input file is divided into number of bocks having default size of 64MB, and on disk file it is 512bytes, and for relation database it is 4KB to 32KB.
D. Job Tracker:
The JobTracker is a service in Hadoop which carries out the MapReduce tasks for a specific node in the cluster. Jobs are submitted to the job tracker by Client applications. NameNode helps JobTracker to determine the location of the data. TaskTracker node is located by JObTracker at near by
Data Mining is also called as Knowledge Discovery from data, it is an automated extraction of patterns which represents knowledge that is captured in large databases, warehouses and other massive information or data streams. KDD is used in various domains to gather hidden information from the data, it proved out to be a strong base of many informations system. Iot gathers data from several places which may contains data for itself. When KDD works with IoT, the collected data by IoT is converted into some useful information that is later converted into ”Knowledge”. It is important to keep a track of KDD procedures as they have affect the previous stage of mining. Not all parts of information are important for mining, for which highlight determination is used to select some valuable data from each records in the database for mining.
A. Basic Idea of Using data Mining for IoT
Creating a data is more simpler that analyzing the data. Number of studies have been done to solve the problem of big data on IoT. The possible solution that can be seen is applying KDD to Iot with respect of hardware and cloud
computing. For a high-performance data mining KDD mod-ule for IoT, three major aspects that needs to be implemented to solve be solved by KDD which are as below.
Objective(O): There has to be a clear assumptions, measurements of the problem and a clear limitation.
Data(D): The characteristics of Data like the size, distribution and the representations is the second most important aspect in Data Mining.
Mining Algorithm(A): With the Objectives and Data, data mining algorithm can clearly be determined.
B. Data Mining for IoT Applications
As the IoT devices and the Internet sensors are growing rapidly, there is an increase in the applications related to this field. Applications like Smart City which includes traffic control, residential E-meters, Pipeline Leak Detection and app like health care etc are successful till now.
C. A Hybrid Data Mining Algorithm
There are many methods introduced to solve the problem of classification, out of which one is the hybrid classifier which is implemented by combining non-evolutionary and evolutionary algorithm.
In this article various Data mining Technologies have been suggested with IoT and Big Data which is implemented using various algorithms. Firstly the paper includes the concept of Data Mining which includes the concept of Big Data Mining. Secondly various data minng existing algorithms have been discussed which are implemented using some big data methods like Apache Hadoop, MapReduce and Apache Spark with a hybrid Data Mining algorithm. Moreover, there is a deep survey on last five year’s paper which includes the current algorithms which is used to implement large amount of data. Hadoop is the first invented Big data method, whereas apache spark is newly introduced. this paper also includes the Data mining techniques used for IoT which includes clustering and pattern mining technologies. The IoT developement is a new and on going process. The data mining algorithm needs a modification in its mathematical operations or by combining some algorithms known as hyrid approach or by implemented a parallel approach on Hadoop and Spark. So, this paper gives a overall view of the data mining techniques for Big data and IoT.
Browse our vast selection of original essay samples, each expertly formatted and styled
Where do you want us to send this sample?
Be careful. This essay is not unique
This essay was donated by a student and is likely to have been used and submitted before
Download this Sample
Free samples may contain mistakes and not unique parts
Sorry, we could not paraphrase this essay. Our professional writers can rewrite it and get you a unique paper.
Please check your inbox.
We can write you a custom essay that will follow your exact instructions and meet the deadlines. Let's fix your grades together!