Friday, June 10, 2011

Introducing Hadoop

Today every action we perform is recorded as data with one or more locations as several web applications are integrated with each other and can share data easily. So one action can perform multiple records. We are dealing with petabytes these days. Available technologies are becoming incompatible to process large scale data.

Google was the first to introduce MapReduce, a framework to process large scale data. Doug Cutting developed an open source version of this MapReduce system called Hadoop. Today many web giants like Yahoo and Facebook are using Hadoop for large scale data processing. Large scale distributed data processing has become an essential skill for every programmer. Key universities have included Hadoop in their curriculum.

Hadoop is an open source framework for developing and running distributed applications and data. It is accessible, robust, scalable and simple. Hadoop runs on large clusters of computers including cloud computing services. It is intended to run on commodity hardware, Hadoop is architected with the assumption of frequent hardware malfunctions. It is scalable to handle large data by adding more nodes to the cluster. Hadoop allows developers to quickly write smart parallel code.

Hadoop has a separate file system known as “Hadoop Distributed File System (HDFS)”, which breaks data into smaller parts (64 MB by default) and distributes among several machines in the cluster.
Hadoop uses key/value pairs as its basic data unit. In Hadoop, data can originate in any form, but it eventually transforms into (key/value) pairs for the processing functions to work on. The Hadoop MapReduce platform is very flexible and works generally with scripts and codes as compared to the typical structured relational data.

The MapReduce model works with two methods named ‘map’ and ‘reduce’. The ‘map’ method processes the input data and the reducer processes all the outputs from the mapper. The ‘map’ method runs parallel and isolated which is the secret for good performance.

Once an application is written by using the MapReduce framework, it can scale to run on tens of thousands of machines in a cluster. This has attracted developers around the world towards Hadoop.

No comments:

Post a Comment