Apache Hadoop is a framework that facilitates the processing of large and extensive data sets on multiple computers using a simple programming model: map/reduce paradigm.
Instead of relying on high-availability hardware, the framework itself is designed to detect application-level errors.
It is an open source project under the Apache Foundation, with a global community of contributors, the most significant of which was Yahoo!
Hadoop was created by Doug Cutting, who named him after his son’s plush toy. It was initially developed to provide a distributed system for the Nutch search engine in 2004 and is based on articles about the Google File System (GFS), and MapReduce made public by Google at that time.
Apache Hadoop is a cloud computing software development framework that lets you write and run distributed applications that process large amounts of data under a free license. Applications can be run on hundreds of independent computing systems and can process petabytes of information.
Hadoop consists of Hadoop Common utilities that provide access to the Hadoop file system. The Hadoop Common package contains the JAR files and the scripts needed to launch Hadoop. The software package containing the Hadoop Common utilities also includes source code, documentation, and contributing section, which brings together projects from the Hadoop developer community.
Hadoop has two main components: HDOS (Hadoop Distributed File System) and YARN (Yet Another Resource Negotiator) and, as we have seen above, uses the MapReduce computational paradigm. This paradigm is the basis of Hadoop, because the system was conceived around the idea of local data, in the sense that it is much quicker and easier to send computations – think of jars or even bash scripts – where the data is instead of bringing or importing the data – think of terabytes – through the network where these scripts are. Therefore, the design decision to have two main subsystems becomes more explicit, because you need a data-handling system and another one to handle computing but also resource management, as we will see immediately.
Without further elaboration, let’s talk a bit about the HDFS component. As its name suggests, HDFS is a distributed file system that implements a “relaxed” POSIX standard that was designed to be run on commodity hardware. The simplest version of a HDFS cluster has two types of daemons: Namenode and Datanode, in a master-slave architecture. HDFS exposes files that are kept as 128MB blocks on Datanodes. Typically, there is only one Namenode in the cluster, whose role is to keep metadata from the file system. It regulates users’ access to files, as well as operations such as opening, closing, and file naming. At the same time, it maintains a mapping of blocks on Datanodes. Datanodes maintain all blocks from which the file system is composed and are responsible for writing and reading requests management.
HDFS has been designed with great tolerance for mistakes by replicating blocks between Datanodes, a design concept that also aids in the distribution of computations, but we will talk about this later. HDFS can be configured with High Availability, which assumes the possibility of “rolling restarts” and at the same time prevents the occurrence of downtime in case of an unpleasant downtime of Namenode.
The second component of Hadoop, which deals with computations, is YARN (also known as MapReduce 2.0). Available only in version 2.x. YARN shares the functionality of three types of daemons: ResourceManager, NodeManager, and ApplicationMaster. The YARN architecture also works on the master/slave principle because the cluster has a single ResourceManager that serves resource needs (memory and CPUs from 2.6) for the entire cluster, while NodeManagers are spread across all hosts in the cluster. In general, we can think of ResourceManager as an operating system CPU planner and NodeManagers as the processes of a system that calls for time on the processor, a concept known as Datacenter as a Computer.
To see all parts as a whole, we can analyze the life cycle of a YARN task. The computational unit in YARN is the container that is a JVM essentially. This container can take two functions: it can be a mapper or it can be a reduction. Once the task is sent to execution, ResourceManager allocates an ApplicationMaster to one of the nodes in the cluster, which is in turn responsible for executing the task.
ApplicationMaster negotiates with ResourceManager for more resources, but also with NodeManagers who launch the necessary containers to complete the task. Because of the data location principle, ApplicationMaster may require different NodeManagers to work on the same task (if NodeManagers are found on duplicate data nodes). This approach is even more useful in large clusters that have a higher failure rate because it makes these failures invisible for the given tasks. And NodeManagers interact with ResourceManager, but only to provide information about available resources.
Some features of YARN that are worth mentioning are cluster configuration with High Availability, a possibility that starting with 2.6, also transfers the status of Tasks from ResourceManager to Passive in case of failure; this means that the cluster can operate at normal parameters during a rolling roll or unexpected downtime. Another feature is the existence of queues that allow the cluster to be used in different teams at once, with some safeguards.