Important Hadoop Terminology

CHAPTER 4: Important Hadoop Terminology

Important Hadoop Terminology is the 4th chapter in HDFS Tutorial Series.  In this section, I will talk about some of the important terminologies about the HDFS or you can say Hadoop. These terms are the building blocks and throughout Hadoop, you will use these terms and so please try to REALLY UNDERSTAND these.

If you are clear with these basic, learning Hadoop will be fun else you will never enjoy and always you will think how things are happening.

Hadoop terminologySo let’s start with basic Hadoop terminology one by one-

1. Cluster- A cluster is a set of computers which consists of DataNodes and NameNode.
2. NameNode- NameNode is a single computer which is usually a high-class hardware. It is NameNode will act like a monitor and supervise the operation performed by DataNodes. It also stores the Metadata of the files.
3. DataNode- DataNode is also a computer but these are mainly commodity hardware and it is DataNode which actually store the files and process.
4. Mapper- Suppose we have data on multiple DataNodes and those are performing the operation and we have to find a particular file from the entire cluster…how will we do?
Here comes the role of Mapper. Mapper goes at each DataNode and runs certain code/operation to get the desired work done. So it will go, execute and run the code to find where the data actually exist.
5. Reducer- Now as Mapper has run the code and so Reducer’s work is to get the result from each mapper.
6. JobTracker- the JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack. The JobTracker is a point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted.
7. TaskTracker- It is basically the reference of job tracker at DataNodes. A TaskTracker is a node in the cluster that accepts tasks – Map, Reduce and Shuffle operations – from a JobTracker.
Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.

8. Block- It is the smallest unit in which files are getting split. By default a block size is 64 MB in Hadoop 1 and 128 MB in Hadoop 2 which can be increased as required.

9. Secondary NameNode- It plays its role in the case of a fault. Suppose files are processing at DataNodes and suddenly a DataNode has become faulty and so data will be lost…right?
10. Safemode- On startup, the NameNode enters a special state called Safemode. Replication of data blocks does not occur when the NameNode is in the Safemode state. The NameNode receives Heartbeat and Blockreport messages from the DataNodes. A Blockreport contains the list of data blocks that a DataNode is hosting.

Each block has a specified minimum number of replicas. A block is considered safely replicated when the minimum number of replicas of that data block has checked in with the NameNode. After a configurable percentage of safely replicated data blocks checks in with the NameNode (plus an additional 30 seconds), the NameNode exits the Safemode state. It then determines the list of data blocks (if any) that still have fewer than the specified number of replicas. The NameNode then replicates these blocks to other DataNodes.

NO…HDFS is highly faulted tolerant and so whenever any DataNode will become faulty, a backup node will take its data and will start operating.

That backup node is called secondary NameNode. It is the safe mode of Hadoop and mainly supervise the operation.

Secondary NameNode keeps the data of the faulty DataNode in the below two location-

• FS Image- Main block is stored
• Edit Log- File Name is stored

Few things to remember here is-

• In a cluster, there can be only one NameNode and can be many DataNodes
• These NameNode and DataNode are nothing but computers
• The number of blocks depends on the file size. All the blocks will be of the same size except the last as the remaining size will be the size of the last block.
• Number of blocks (input splits) and number of Mappers will be same
• Hadoop can work without Reducer but not without Mapper
• Number of reducers will be equal to number of output

Now let me take you through a couple of interesting and VERY IMPORTANT chapters.

Previous Chapter: Why HDFS Needed?CHAPTER 5: HDFS Architecture