When we talk about data and storage then one has to definitely mention about Apache Hadoop. Apache Hadoop is an open software framework or a complete solution for big data storage. It deals with complexities of high volume, velocity and variety of data. It is storage and large scale processing of data-sets on clusters of commodity hardware.
Hadoop is an Apache top-level project being built and used by a global community of contributors and users. It is licensed under the Apache License 2.0.
Hadoop was created by Doug Cutting and Mike Cafarella in 2005.
It was originally developed to support distribution for the Nutch search engine project. Dough who is now chief architect of cloudera, was working on Yahoo at that time and it is popularly know that he named this project after his son's toy elephant.
It is said that the 2 years toddler who had just began to talk was playing with his favourite toy and as every other kid they would name their favourite toy with names. Cutting’s son named his yellow stuffed elephant as ‘Hadoop’. Now that the name has gained so much popularity there are rumours that the dad and son who is now 12 year old share royalties for this popularity.
This Apache Hadoop Framework is composed of the following modules
It is also known as Hadoop Core that acts a supporting system containing libraries and utilities that is needed by other Hadoop modules.
Hadoop Common: It is also known as Hadoop Core that acts a supporting system containing libraries and utilities that is needed by other Hadoop modules.
Hadoop Distributed File System (HDFS): a distributed file-system that stores data on the commodity machines, providing very high aggregate bandwidth across the clusters. These are also the primary data storage system that the hadoop application uses.
Hadoop YARN: This helps in handling computer resources in clusters and employing them for scheduling user’s applications hence known as a resource-management platform.
Hadoop MapReduce: If there is a data processing of large scale then this programming model can be made use of. All these modules in Hadoop are designed with a fundamental assumption. If there is a hardware failures (of individual machines or racks of machines) which is treated as a common issue and such situations should be automatically handled in software by the framework. Apache Hadoop's MapReduce and HDFS components are originally derived respectively from Google's MapReduce and Google File System (GFS) papers.
MapReduce Java code is the commonly used and well known for the end-users. With Hadoop Streaming any programming language can be used to implement the map and reduce the parts of the user’s program. The projects that get exposed to higher level of user interfaces like Pig Latin and a SQL variant are Apache Pig and Apache Hive.The most written frame work in he Java programming language is Hadoop framework and is combined with some native code in C and command line utilities written as shell-scripts.
HDFS and MapReduce : HDFA and the MapReduce are considered the open source projects that are basically inspired Google inmade technologies. These form the two main aspects at the core of Apache Hadoop 1.X the Hadoop Distributed File System (HDFS) and the MapReduce parallel processing framework.
A scalable, distributed and portable file-sytem that is wrirren in java for the Hadoop framework is known as the Hadoop distributed file system (HDFS Every node in the Hadoop system basically will have a single namenode and a cluster of datanodes that form the HDFS cluster. This happens usually as each node is not dependant on a datanode for its presence. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. This system of files uses the TCP/IP layer for communication and Clients use Remote Procedure Call (RPC) for communication.
HDFS is used to store large files (typically in the range of gigabytes to terabytes) across multiple machines. By replicating the data across multiple hosts, and not requiring the RAID storage on hosts, it achieves reliability. With the default replication value 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes talk to each other to rebalance data, to move copies around, and to keep the replication of data competitive. HDFS is said to be not complaint with POSIX as the POSIX file system requirements is different from the target goals for a Hadoop Application. The tradeoff of not having a fully POSIX-compliant file-system is increased performance for data throughput and support for non-POSIX operations such as Append.
Apache Hadoop NextGen MapReduce (YARN) : MapReduce 2.0 (MRv2) or YARN is obtained after MapReduce has undergone the complete overhaul in hadoop-0.23. Apache™ Hadoop® YARN is a sub-project of Hadoop at the Apache Software Foundation introduced in Hadoop 2.0 that separates the resource management and processing components. YARN was boon for the need that enables a broader array of interaction patterns for data stored in HDFS beyond Map Reduce. The lovely interfaced YARN-based architecture of Hadoop 2.0 provides a more general processing platform that is not constrained to Map Reduce.
What YARN does? YARN amplifies the power of a Hadoop compute cluster in the following ways:
Scalability : As YARN Resource Manager focuses exclusively on scheduling, the processing power in data centres continues to grow quickly. This helps the data centres to manage these larger clusters with ease.
Affinity with Map Reduce : Without disruption to their existing processes the existing Map Reduce applications and users can run on top of YARN. Improved cluster utilization: Based on the criteria such as capacity guarantees, fairness, and SLAs the cluster utilization optimization takes place with a pure scheduler known as the Resource Manager. Support for workloads other than Map Reduce: Graph Processing and Iterative modelling which are considered as the additional programming models now aid the data processing. These models help the enterprises to know the near real-time processing and ROI advancement on their Hadoop Investments.
Agility : With Map Reduce becoming a user-land library, it can evolve independently of the underlying resource manager layer and in a much more agile manner. Functionality of YAN
YARN was developed to split up the two major responsibilities of the Job Tracker/Task Tracker into separate entities:
1) a global Resource Manager.
2) a per-application Application Master.
3) a per-node slave Node Manager and
4) a per-application container running on a Node Manager.
The Node Manger and the Resource Manager form the new, and generic, system for managing applications in a distributed manner. The Resource Manager has the supreme power and controls the arbitrates resources among all the applications in the system. The per-application Application Master is a framework-specific entity and is tasked with negotiating resources from the Resource Manager and working with the Node Manager(s) to execute and monitor the component tasks. With the help of a scheduler, the Resource Manager, allocates resources to the various running applications, according to constraints such as queue capacities, user-limits etc. The scheduler performs its scheduling function based on the resource requirements of the applications.
The Node Manager takes care of launching the applications’ containers and monitoring its resource usage (CPU, Memory, Disk and Network) is the per-machine slave that reports the same to the Resource Manager. Each Application Master has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status, and monitoring their progress. From the system perspective, the Application Master runs as a normal container.