Hadoop and MongoDB
What is Hadoop?
Hadoop is a software technology designed for storing and processing large volumes of data using a cluster of commodity servers and commodity storage. Hadoop is an open-source Apache project started in 2005 by engineers at Yahoo. It consists of a distributed file system, called HDFS, and a data processing and execution model called MapReduce.
A Brief History of Hadoop
Like a number of innovative software technologies developed over the past decade, Hadoop’s design is based on published research describing products created at Google. In 2003 and 2004 Google published two seminal papers – the first on their distributed file system, GFS, and the second on their software for executing data processing tasks across a large numbers of servers, MapReduce. Both papers built on decades of computer science and Google’s unique combination of data and computing resources to describe an architecture for reliably and cost-effectively processing large volumes of data.
Like Google, Yahoo was also in the business of maintaining an index of the internet’s data, and in 2005 members of their engineering began to build a system based on the ideas presented by Google’s research papers. Doug Cutting, inventor of Lucene, Nutch and other software projects, was the lead engineer at Yahoo and named the new project after his son’s stuffed yellow elephant, Hadoop.
Many other software projects have been created to complement core Hadoop, including tools that move data into and out of the system, and languages that make it easier to process the data. Today several software companies have been created to provide support for Hadoop as well as value-added tools to make Hadoop easier to install, manage and use.
Applications submit work to Hadoop as jobs. Each job is expressed in the form of MapReduce. Jobs are submitted to a Master Node in the Hadoop cluster, to a centralized process called the JobTracker. One notable aspect of Hadoop’s design is that processing is moved to the data rather than data being moved to the processing. Accordingly, the JobTracker compiles jobs into parallel tasks that are distributed across the copies of data stored in HDFS. The JobTracker maintains the state of tasks and coordinates the result of the job from across the nodes in the cluster. Hadoop 2.0 introduces YARN, a refining of the JobTracker role that allows for better resource management within the system.
Hadoop determines how best to distribute work across resources in the cluster, and how to deal with potential failures in system components should they arise. A natural property of the system is that work tends to be uniformly distributed – Hadoop maintains multiple copies of the data on different nodes, and each copy of the data requests work to perform based on its own availability to perform tasks. Copies with more capacity tend to request more work to perform.
Properties of the System
There are several architectural properties of Hadoop that help to determine the types of applications suitable for the system:
- Hadoop processes data stored in HDFS.
- HDFS provides a write-once-read-many access model for data.
- HDFS is optimized for large files (64MB by default).
- HDFS maintains multiple copies of the data for fault tolerance.
- HDFS is designed for high-throughput rather than low-latency.
- HDFS is not schema-based; data of any type can be stored.
- Hadoop jobs define a schema for reading the data within the scope of the job.
- Hadoop does not use indexes. Data is scanned for each query.
- Hadoop jobs tend to execute over several minutes.
How Organizations Are Using Hadoop
Hadoop is designed for data and processing tasks that exceed the capacity of a single server, and applications tend to be based on large volumes of data. Organizations use Hadoop for retrospective, read-only applications such as:
- Cost-effective data archiving
- Data Analysis
- Data Aggregation
- Data Transformation
- Pattern Matching
Other tools have been developed to simplify the access and processing of data in Hadoop:
- Hive – SQL-like access to data
- Pig – a scripting language for accessing and transforming data
- Sqoop – a tool for moving data between relational databases and Hadoop
- Flume – a tool for collecting data from log files into HDFS
- MongoDB Hadoop Connector – a framework for accessing data in MongoDB from MapReduce jobs, including HIVE and Pig, and incrementally writing output back to collections in MongoDB.