Advanced Analytics Forum

Clear all

[Solved] Top 10 Hadoop Interview Question And Answers - Experts Choice

New Member
Joined: 2 years ago
Posts: 1
Topic starter  

Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Originally designed for computer clusters built from commodity hardware still the common use it has also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework

1) What is Hadoop and Big Data?


Big Data and Hadoop are technologies used to handle large amounts of data. Big Data is a large amount of data that consists of structure, unstructured data, that cannot be stored or processed by traditional data storage techniques. Hadoop, on the other hand, is a tool that is used to handle big data

2) What is Hadoop? What are the primary components of Hadoop?


Hadoop is an infrastructure equipped with relevant tools and services required to process and store Big Data. To be precise, Hadoop is the ‘solution’ to all the Big Data challenges. Furthermore, the Hadoop framework also helps organizations to analyze Big Data and make better business decisions.

The primary components of Hadoop are:

  • HDFS
  • Hadoop MapReduce
  • Hadoop Common
  • YARN
  • PIG and HIVE – The Data Access Components.
  • HBase – For Data Storage
  • Ambari, Oozie and ZooKeeper – Data Management and Monitoring Component
  • Thrift and Avro – Data Serialization components
  • Apache Flume, Sqoop, Chukwa – The Data Integration Components
  • Apache Mahout and Drill – Data Intelligence Components

3) Name some practical applications of Hadoop.

Here are some real-life instances where Hadoop is making a difference :

  • Managing street traffic
  • Fraud detection and prevention
  • Analyze customer data in real-time to improve customer service
  • Accessing unstructured medical data from physicians, HCPs, etc., to improve healthcare services.

4) What are the various Hadoop daemons? and What are their roles in a Hadoop cluster?

Generally approach this question by first explaining the HDFS daemons i.e. NameNode, DataNode and Secondary NameNode, and then moving on to the YARN daemons i.e. ResorceManager and NodeManager, and lastly explaining the JobHistoryServer.

NameNode: It is the master node that is responsible for storing the metadata of all the files and directories. It has information about blocks, that make a file, and where those blocks are located in the cluster.

Datanode: It is the slave node that contains the actual data.

Secondary NameNode: It periodically merges the changes (edit log) with the FsImage (Filesystem Image), present in the NameNode. It stores the modified FsImage into persistent storage, which can be used in case of failure of NameNode.

ResourceManager: It is the central authority that manages resources and schedule applications running on top of YARN.

NodeManager: It runs on slave machines, and is responsible for launching the application’s containers (where applications execute their part), monitoring their resource usage (CPU, memory, disk, network) and reporting these to the ResourceManager.

JobHistoryServer: It maintains information about MapReduce jobs after the Application Master terminates.

5) Name the most common Input Formats defined in Hadoop? Which one is default?


The two most common Input Formats defined in Hadoop are:

  • TextInputFormat
  • KeyValueInputF5ormat
  • SequenceFileInputFormat
  • TextInputFormat is the Hadoop default.

6)How Hadoop and Big Data are interrelated?


Big data is the collection of large complex data sets and analyzing it. This is what Hadoop does.

Apache Hadoop is an open-source framework used for storing, processing, and interpreting complex unstructured data sets for obtaining insights and predictable analysis for businesses.

The prior main components of Hadoop are-

  • MapReduce – A programming model which processes massive datasets in parallel
  • HDFS– A Java-based distributed file system used for data storage
  • YARN – A framework that handles resources and requests from assigned applications.

7) Explain three running modes of Hadoop?

Hadoop runs in three modes

Standalone Mode

This is the default mode of Hadoop for both input and output operations. This mode is mainly used for debugging and doesn’t support HDFS use.

Pseudo-Distributed Mode (Single-Node Cluster)

In this mode, a user can configure all the three files. In this case, both the Master and Slave node is the same as all daemons are running on one node.

Fully Distributed Mode (Multiple Cluster Node)

This mode is the production phase of Hadoop where data is used and distributed across several nodes on a Hadoop cluster.


8)   What are the most common Input Formats in Hadoop?

There are three most common input formats in Hadoop:

  • Text Input Format: Default input format in Hadoop.
  • Key-Value Input Format: used for plain text files where the files are broken into lines
  • Sequence File Input Format: used for reading files in sequence.

9) How does speculative execution work in Hadoop?


JobTracker makes different TaskTrackers pr2ocess the same input. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.

10) Explain what is Job Tracker in Hadoop What are the actions followed by Hadoop?


In Hadoop for submitting and tracking MapReduce jobs, JobTracker is used. Job tracker runs on its own JVM process

Hadoop performs the following actions in Hadoop

Client application submit jobs to the job tracker

JobTracker communicates to the Namemode to determine data location

Near the data or with available slots JobTracker locates TaskTracker nodes

On chosen TaskTracker Nodes, it submits the work

When a task fails, Job tracker notifies and decides what to do then.

The TaskTracker nodes are monitored by JobTracker


This topic was modified 2 years ago by HDFS Tutorial Team