badge badge

Scenario Based Hadoop Interview Questions and Answers [Mega List]

If you have ever appeared for the Hadoop interview, you must have experienced many Hadoop scenario based interview questions.

Hadoop scenario based interview questionsHere I have compiled a list of all Hadoop scenario based interview questions and tried to answer all those Hadoop real time interview questions. You can use these Hadoop interview questions to prepare for your next Hadoop Interview.

Also, I will love to know your experience and questions asked in your interview. Do share those Hadoop interview questions in the comment box. I will list those in this Hadoop scenario based interview questions post. Let’s make it the only destination for all Hadoop interview questions and answers.

Let’s start with some major Hadoop interview questions and answers. I have covered the interview questions from almost every part of Hive, Pig, Sqoop, HBase, etc.

1. What are the differences between -copyFromLocal and -put command

Ans: Basically, both put and copyFromLocal fulfill similar purposes, but there are some differences. First, see what both the command does-

-put: it can copy the file from source to destination

– copyFromLocal: It copies the file from local file system to Hadoop system

As you saw, put can do what copyFromLocal is doing but the reverse is not true. So the main difference between -copyFromLocal and -put commands is, in -copyFromLocal, the source has to be the local file system which is not mandatory for –put command.

Uses of these commands-

hadoop fs -copyFromLocal <localsrc> URI

hadoop fs -put <localsrc> … <destination>

2. What are the differences between -copyToLocal and -put command

Ans: The answer will be similar to what I explained in the above question. The only difference is, there it was –copyFromLocal and here it is –copyToLocal.

So in –copyToLocal command, the destination has to be the local file system.

3. What is the default block size in Hadoop and can it be increased?

Ans: The default block size in Hadoop 1 is 64 MB while in Hadoop 2, it is 128MB.

It can be increased as per your requirements. You can check Hadoop Terminology for more details.

In fact changing the block size is very easy and you can do it by setting fs.local.block.size in the configuration file easily. Use the below command to change the default block size in Hadoop.

hadoop fs -D fs.local.block.size=sizeinKB -put local_name remote_location


Just put the size you want of a block in KB in place of “sizeinKB” variable.

4. How to import RDBMS table in Hadoop using Sqoop when the table doesn’t have a primary key column?

Ans: Usually, we import an RDBMS table in Hadoop using Sqoop Import when it has a primary key column. If it doesn’t have the primary key column, it will give you the below error-

ERROR tool.ImportTool: Error during import: No primary key could be found for table <table_name>. Please specify one with –split-by or perform a sequential import with ‘-m 1’

Here is the solution of what to do when you don’t have a primary key column in RDBMS, and you want to import using Sqoop.

If your table doesn’t have the primary key column, you need to specify -m 1 option for importing the data, or you have to provide –split-by argument with some column name.

Here are the scripts which you can use to import an RDBMS table in Hadoop using Sqoop when you don’t have a primary key column.

sqoop import \
–connect jdbc:mysql://localhost/dbname \
–username root \
–password root \
–table user \
–target-dir /user/root/user_data \
–columns “first_name, last_name, created_date”
-m 1


or

sqoop import \
–connect jdbc:mysql://localhost/ dbname\
–username root \
–password root \
–table user \
–target-dir /user/root/user_data \
–columns “first_name, last_name, created_date”
–split-by created_date

5. What is CBO in Hive?

Ans: CBO is cost-based optimization and applies to any database or any tool where optimization can be used.

So it is similar to what you call Hive Query optimization. Here are the few parameters, you need to take care while dealing with CBO in Hive.

  • Parse and validate query
  • Generate possible execution plans
  • For each logically equivalent plan, assign a cost

You can also check Hortonworks technical sheet on this for more details.

6. Can we use LIKE operator in Hive?

Yes, Hive supports LIKE operator, but it doesn’t support multi-value LIKE queries like below-

SELECT * FROM user_table WHERE first_name LIKE ANY ( ‘root~%’ , ‘user~%’ );


So you can easily use LIKE operator in Hive as and when you require. Also, when you have to use a multi-like operator, break it so that it can work in Hive.
E.g.:

WHERE table2.product LIKE concat(‘%’, table1.brand, ‘%’)

7. Can you use IN/EXIST operator in Hive?

No, Hive doesn’t support IN or EXIST operators. Instead, you can use left semi join here. Left Semi Join performs the same operation IN do in SQL.

So if you have the below query in SQL-

SELECT a.key, a.value
FROM a
WHERE a.key in
(SELECT b.key
FROM B);


Then the suitable query for the same in Hive can be-

SELECT a.key, a.val
FROM a LEFT SEMI JOIN b on (a.key = b.key)


Both will fulfill the same purpose.

8. What are the differences between INNER JOIN and LEFT SEMI JOIN?

Ans: Left semi-join in Hive is used instead of IN operator (as IN is not supported in Hive). Now coming to the differences, inner join returns the common data from both the table depending on condition applied while left semi joins only returns the records from the left-hand table.

9. What are the differences between External and Internal Tables in Hive

Ans: As we know there are a couple of kinds of tables in Hive- Internal and External (Managed) table. In the internal table (default), data will be stored at the default Hive location while in the external table; you can specify the location.

The major difference between the internal and external tables are-

External TableInternal Table
External table stores files on the HDFS Stored in a directory based on settings in hive.metastore.warehouse.dir, by default internal tables are stored in the following directory “/user/hive/warehouse” you can change it by updating the location in the config file.
If you delete an external table the file still remains on the HDFS server.
As an example if you create an external table called “table_test” in HIVE using HIVE-QL and link the table to file “file”, then deleting “table_test” from HIVE will not delete “file” from HDFS.
Deleting the table deletes the metadata & data from master-node and HDFS respectively.
External table files are accessible to anyone who has access to HDFS file structure and therefore security needs to be managed at the HDFS file/folder level.Deleting the table deletes the metadata & data from master-node and HDFS respectively.
• Internal table file security is controlled solely via HIVE. Security needs to be managed within HIVE, probably at the schema level (depends on organisation to organisation).
Meta data is maintained on master node and deleting an external table from HIVE, only deletes the metadata not the data/file.It is the default table in Hive.

Hive may have internal or external tables this is a choice that affects how data is loaded, controlled, and managed.

10. When to use external and internal tables in Hive?

Use EXTERNAL tables when:

  • The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn’t lock the files.
  • Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.
  • Hive should not own data and control settings, dirs, etc., you may have another program or process that will do those things.
  • You are not creating a table based on existing table (AS SELECT).

Use INTERNAL tables when:

  • The data is temporary
  • You want Hive to completely manage the lifecycle of the table and data

11. We have a Hive partitioned table with partition column as country. We have 10 partition and data for now is jut for one country, If we will copy the data manually for other 9 partitions, whether those will be reflected if we will run a command.

Ans: This is really a good question. As the data has been kept manually in all the other file directory and so directly it won’t be available.

Data will be available directly for all partition when you will put it through command and not manually.

12. Where the Mapper’s Intermediate data will be stored?

Ans: The mapper output (which is intermediate data) is stored on the Local file system (not in HDFS) of each mapper nodes. This is a temporary directory location which can be setup in the configuration file by the Hadoop administrator. The intermediate data is cleaned up after the Hadoop Job completes.

13. What is Partition and Combiner in MapReduce?

Partition and combiner are the two phase of a MapReduce operation those are executed before the reduce phase and after the map phase. Here are the details of partition and combiner in MapReduce.

Combiner: Combiner works like a mini reducer in Map phase which takes the input from map phase. It performs local reduce function on mapper result before they are distributed further. Once combiner functionality is executed (if required) then the output is passed to the reducer phase.

Partition: Partition comes into picture when you are using more than one reducer. Partition decides which reducer is responsible for a particular key.

It takes the input from mapper phase or Combiner phase (if used) and then sends it across the responsible reducer based on the key. The number of partitions is equal to the number of reducers.

So in partition and combiner, combiner comes first and then partition. The below image from Yahoo depicts the operation beautifully.

When combiner is being used

Partition and combiner in MapReduce

When Combiner is not being used

Partition and combiner in MapReduce[Images are from Yahoo]

 

One Response

  1. ri6cau88q@hotmail.com' Heidi

Leave a Reply

Your email address will not be published. Required fields are marked *