450+ experts on 30 subjects ready to help you just now
Starting from 3 hours delivery
Remember! This is just a sample.
You can get your custom paper by one of our expert writers.
Get custom essay121 writers online
Big data often refers simply to the use of predictive analytics, user behavior analytics, or certain other advanced data analysis methods that extract value from data, and seldom to a particular size of data set. The amount of data that’s being created and stored on a global level is almost unimaginable, and it just keeps growing. That means there’s even more potential to collect key insights from business information – yet only a small percentage of data is actually analyzed. What does that mean for businesses? How can they make better use of the raw information that flows into their organizations every day? Analysis of data sets can find new correlations to “”spot business trends, prevent diseases, combat crime and so on””. If a data satisfies these 4 characteristics (Volume, Variety, Velocity, and Veracity) is called as big data which needs real-time distributed processing. To process this enormous amount of flowing data they devised these techniques Hadoop.
Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. Hadoop has two main systems: the Hadoop Distributed File System (HDFS) and MapReduce engine. The best advantage which HDFS provides is a support to non-structural data like shopping pattern which where RDBMS breaks down on higher amount even though we have supports like BLOB data types. So, whenever we have a small amount of structured or semi-structured data and regular DML operations are required its better to use traditional RDBMS and when the amount of data is large and requires only storage of data it is better to use HDFS e.g. search engine.
This paper focuses on the comparison of Hadoop and Traditional Relational Database. It also focuses on characteristics and advantages of Hadoop. Big Data is a very familiar term that describes voluminous amount of data that is structural, semi-structural and sub-structural data that has potential to be mined for information. Although big data does not refer to any specific quantity, then this term is often used when speaking about the pet bytes and Exabyte of data.
In the Big data world the sheer volume, velocity, and variety of data render most ordinary technologies ineffective. Thus in order to overcome their helplessness companies like Google and Yahoo! needed to find solutions to manage all the data that their servers were gathering in an efficient, cost-effective way. Hadoop was originally created by a Yahoo! Engineer, Doug Cutting, as a counter-weight to Google’s BigTable. Hadoop was Yahoo!’s attempt to break down the big data problem into small pieces that could be processed in parallel. Hadoop is now an open source project available under Apache License 2.0 and is now widely used to manage large chunks of data successfully by many companies.
In the early 2000s, to locate relevant information and the text-based content, search engines were created. In the early years, search results were handled by humans. But as the web grew from dozens to millions of pages, automation was needed. So search engines like Yahoo, AltaVista were introduced.
They started one such project that was an open-source web search engine called Nutch – the brainchild of Doug Cutting and Mike Cafarella. They wanted to return web search results faster by distributing data and calculations across different computers, so multiple tasks could be able to perform simultaneously. During this time, another search engine project called Google was in progress. It was based on the same concept – storing and processing data in a distributed, automated way so that relevant web search results could be returned faster.
In 2006, Cutting joined Yahoo and took with him the Nutch project as well as ideas based on Google’s early work with automating distributed data storage and processing. The Nutch project was divided – the web crawler portion remained as Nutch and the distributed computing and processing portion became Hadoop (named after Cutting’s son’s toy elephant). In 2008, Yahoo released Hadoop as an open-source project. Today, Hadoop’s framework and ecosystem of technologies are managed and maintained by the non-profit Apache Software Foundation (ASF), a global community of software developers and contributors.
Like Hadoop, traditional RDBMS cannot be used when it comes to process and store a large amount of data or simply big data. Following are some differences between Hadoop and traditional RDBMS.
For example, if you want to write code to transfer money from one bank account to another one, you have to code all the scenarios like what happens if money is withdrawn from one account, but a failure occurs before it is deposited to another account.
Hadoop has two main systems:
Once the data is written on the server with the Hadoop Distributed Filesystem and can be subsequently read and re-used many times. When compared with the read/write actions of other file systems, it explains the speed with which Hadoop operates ie it is very fast. This is the reason why HDFS is an excellent choice to deal with the high volumes and velocity of data required today.
Each HDFS cluster contains the following:
NameNode: Runs on a “master node” that tracks and directs the storage of the cluster.DataNode: Runs on “slave nodes,” which make up the majority of the machines within a cluster. The NameNode instructs data files to be split into blocks, each of which is replicated three times and stored on machines across the cluster. These replicas ensure the entire system won’t go down if one server fails or is taken offline—known as “fault tolerance.”Client machine: neither a NameNode or a DataNode, Client machines have Hadoop installed on them. They’re responsible for loading data into the cluster, submitting MapReduce jobs and viewing the results of the job once complete.
In HDFS, it is having a main « NameNode » and multiple « data nodes » on a commodity hardware cluster. All the nodes are usually organized within the same physical rack in the data center. Data is then broken down into separate « blocks » that are distributed among the various data nodes for storage. Blocks are also replicated across nodes to reduce the failure.
The NameNode is the «smart» node in the cluster. It knows exactly which blocks are located within which data node and where the data nodes are located within the machine cluster. The NameNode also manages access to the files, including reads, writes, creates, deletes and replication of data blocks across different data nodes.
To complete a certain task, the data nodes constantly communicate with the NameNode. The constant communication ensures that the NameNode is aware of each data node’s status at all times. Since the NameNode assigns tasks to the individual data nodes, should it realize that a data node is not functioning properly it is able to immediately re-assign that node’s task to a different node containing that same data block? Data nodes also communicate with each other so they can cooperate during normal file operations. Clearly, the NameNode is critical to the whole system and should be replicated to prevent system failure.
Again, data blocks are replicated across multiple data nodes and access is managed by the NameNode. This means when a data node no longer sends a “life signal” to the NameNode, the NameNode unmaps the data note from the cluster and keeps operating with the other data nodes as if nothing had happened. When this data node comes back to life or a different (new) data node is detected, that new data node is (re-)added to the system. That is what makes HDFS resilient and self-healing. Since data blocks are replicated across several data nodes, the failure of one server will not corrupt a file. The degree of replication and the number of data nodes are adjusted when the cluster is implemented and they can be dynamically adjusted while the cluster is operating.
Data integrity is also carefully monitored by HDFS’s many capabilities. HDFS uses transaction logs and validations to ensure integrity across the cluster. Usually, there is one NameNode and possibly a data node running on a physical server in the rack, while all other servers run data nodes only.
Hadoop MapReduce is an implementation of the MapReduce algorithm developed and maintained by the Apache Hadoop project. The general idea of the MapReduce algorithm is to break down the data into smaller manageable pieces, process the data in parallel on your distributed cluster, and subsequently combine it into the desired result or output.
Hadoop MapReduce includes several stages, each with an important set of operations designed to handle big data. The first step is for the program to locate and read the « input file » containing the raw data. Since the file format is arbitrary, the data must be converted to something the program can process. This is the function of « InputFormat » and « RecordReader » (RR). InputFormat decides how to split the file into smaller pieces (using a function called InputSplit). Then the RecordReader transforms the raw data for processing by the map. The result is a sequence of « key » and « value » pairs.
Once the data is in a form acceptable to map, each key-value pair of data is processed by the mapping function. To keep track of and collect the output data, the program uses an « OutputCollector ». Another function called « Reporter » provides information that lets you know when the individual mapping tasks are complete.
Once all the mapping is done, the Reduce function performs its task on each output key-value pair. Finally, an OutputFormat feature takes those key-value pairs and organizes the output for writing to HDFS, which is the last step of the program.
Hadoop MapReduce is the heart of the Hadoop system. It is able to process the data in a highly resilient, fault-tolerant manner. Obviously, this is just an overview of a larger and growing ecosystem with tools and technologies adapted to manage modern big data problems.
Other software components that can run on top of or alongside Hadoop and have achieved top-level Apache project status include:
Future scope of Hadoop Technology:- Hadoop is among the major big data technologies and has a vast scope in the future. Being cost-effective, scalable and reliable, most of the organizations in the world are employing Hadoop technology. It includes storing data on a cluster without any machine or hardware failure, adding a new hardware to the nodes etc.
As the size of data increases, the demand for Hadoop technology will also be increased. There will be a need for more Hadoop developers to deal with the big data challenges.
Following are the different profiles of Hadoop developers according to their expertise and experience in Hadoop technology.
Nowadays, with the rapid growth of the data volume, the storage and processing of Big Data have become the most pressing needs of the enterprises. Hadoop as the open source distributed computing platform has become a brilliant choice for the business. Due to the high performance of Hadoop, it has been widely used in many companies.
1. Hadoop in Yahoo!:- Yahoo! is the leader in Hadoop technology research and applications. It applies Hadoop on various products, which include the data analysis, content optimization, anti-spam email system, and advertising optimization. Hadoop has also been fully used in user interests’ prediction, searching ranking, and advertising location. In the Yahoo! homepage personalization, the real-time service system will read the data from the database to the interest mapping through the Apache. Every 5 minutes, the system will rearrange the contents based on the Hadoop cluster and update the contents every 7 minutes. Concerning span emails, Yahoo! uses the Hadoop cluster to score the emails. Every couple of hours, the Yahoo! will improve the anti-spam email model in the Hadoop clusters and the clusters will push 5 billion times of emails’ delivery every day At present, the largest application of the Hadoop is the Search Web map of Yahoo!. It has been run on more than 10 000 Linux cluster machines.
It is known that Facebook is the largest social network in the world. From 2004 to 2009, Facebook has over 800 million active users. The data created every day is huge. This means that Facebook is facing the problem with big data processing which contains content maintenance, photos sharing, comments, and users access histories. These data are not easy to process so Facebook has adopted the Hadoop and HBase to handle it.
The availability of Big Data, low-cost hardware, and new information management and analytic software have produced a unique moment in the history of the data analysis. Generally, a variety of data can be processed. It may be structured, semi-structured and unstructured. Traditional RDBMS is used only to manage structured and semi-structured data but not to manage unstructured data. Hadoop has the ability to process and store all variety of data whether it is structured, semi-structured or unstructured. Also, it is mostly used to process a large amount of unstructured data. So we can say Hadoop is way better than the traditional Relational Database Management System.
Remember: This is just a sample from a fellow student.
450+ experts on 30 subjects ready to help you just now
Starting from 3 hours delivery
We provide you with original essay samples, perfect formatting and styling
To export a reference to this article please select a referencing style below: