Capgemini Hadoop Interview Questions and Answers

Here are some of the Capgemini Hadoop Interview Questions and Answers asked in a recent interview. We received these Capgemini Hadoop Interview Questions from one of our readers. We compiled the answers and here are few top Capgemini Hadoop Interview Questions and Answers for your reference.

Prepare these Capgemini Hadoop Interview Questions and Answers and do well in the interview. These don’t only apply to Capgemini but other IT companies also ask similar kind of questions.

Capgemini Hadoop Interview Questions and Answers

Contents

1. What is partition and combiner in MapReduce?

Partition and combiner are the two phase of a MapReduce operation those are executed before the reduce phase and after the map phase. Here are the details of partition and combiner in MapReduce.

Combiner: Combiner works like a mini reducer in Map phase which takes the input from map phase. It performs local reduce function on mapper result before they are distributed further. Once combiner functionality is executed (if required) then the output is passed to the reducer phase.

Partition: Partition comes into picture when you are using more than one reducer. Partition decides which reducer is responsible for a particular key.

It takes the input from mapper phase or Combiner phase (if used) and then sends it across the responsible reducer based on the key. The number of partitions is equal to the number of reducers.

So in partition and combiner, combiner comes first and then partition. The below image from Yahoo depicts the operation beautifully.

When Combiner is used

When Combiner is not being used

2. What is HCatalog?

HCatalog enables reading and writing of data in any format for which we use SerDe in Hive. By default, HCatalog supports RC File, CSV, JSON, and Sequence File formats. But for custom formats, the user needs to provide InputFormat, OutputFormat, and SerDe information.

It is built on the top of Hive metastore and incorporates components from the Hive DDL. HCatalog also provides the read and write an interface for Pig and MapReduce and uses Hive CLI for issuing commands.

So in short, HCatalog opens up the hive metastore to the other MapReduce tools. As we know every MapReduce tool has its own perception about the HDFS data. PIG consider the data as a set of file while Hive considers it as a set of tables. HCatalog simply simplifies the process.

Example:
Let’s say we have a table (employee) in Hive with the following details-

ID Name
1 John
2 Chris
3 Peter
4 Lisa

Now if you want to use this table to load data in Pig, then you will have to use pig –useHCatalog command.

Let’s see how to use the commands to export the data in PIG-

A = LOAD ’employee’ USING org.apache.hcatalog.pig.HCatLoader();

HCatalog can be accessed through webchat which makes use of rest API.

3. What are Hadoop and Hadoop ecosystems?

Well, this question can be simply answered by anyone. I am just writing few lines for it.

Hadoop is an open source java-based programming framework which is used to process and store large data sets in distributed environment.

Hadoop is one of the top projects of Apache Software Foundation. Also, Hadoop makes use of commodity hardware for its nodes (DataNodes) and so maintaining a low-cost system.

Here are some of the Hadoop ecosystems which are frequently being used-

  • HDFS: To store the large sets of data
  • Hive: To process structured data
  • Pig: To process unstructured data
  • Oozie: Create workflow jobs
  • Flume: Get real-time data from other sources
  • MapReduce: Data analysis
  • HBase: NoSQL database used for record level operation
  • Sqoop: Import/Export data to and from RDBMS to Hadoop system
  • Kafka: For messaging

There are many another component in Hadoop ecosystem, but the above are important and mostly used.

4. What is view? At first, we have created a view on top of the table (two rows- empid, empname). Then added 100 Rows on the table whether the newly added rows will see in view.

A view is a virtual table and is created based on the result from one or more real table.

Here is the syntax to create view on the table-

CREATE VIEW view_name AS SELECT col1, col2 FROM tablename;

Then you can view the data of the view using the below command-

Select * from view_name;

If you later want to drop the view, just use the below statement-

Drop view view_name;

Now as you know what view is and how to create a view, let’s jump to the second statement of the question.

As the view is created with the contents of the table itself and so updating table will update the view as well. So if you will add 100 new rows to the table then while checking the view, it will also have those newly added 100 rows.

5. At first, we have created a view on top of the table (two rows empid, empname). Then again added a third row (i.e. address ). Whether can we see the newly added row in the view?

This can also be derived from the above Capgemini Hadoop interview questions and answers. Again if the table is getting updated, the view will get updated.

Here you just need to ensure that while creating the view no such condition has been applied which will prevent the view from updating.

For example, if you have just used a couple of column names (col1 and col2) while creating the view then if you will add the third column in the table also, it won’t affect the view.

Hope that clarifies. Please comment for any further questions.

6. In Hive is the delimiter is mandatory? What is the default delimiter?

While you create a table in Hive, you specify the delimiter to let Hive know the format of the data you have in the input file.

CREATE TABLE emp(
Id int, name string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’;

In the above Hive create table example, ‘\t’ is the delimiter.

The default record delimiter in Hive is − \n

And the field delimiters in Hive are − \001,\002,\003

ROW FORMAT should have delimiters used to terminate the fields and lines as shown above.

Hey! I am first heading line feel free to change me

7. Is the mapper is mandatory or is the reducer is mandatory?

Mapper class is mandatory in Hadoop. If you do not use the mapper class in the driver, then IdentityMapper will be used.

Reducer is not the mandatory one. Also, if you are dealing the import operation through Sqoop, the only mapper run and not the reducer.

8. Without loading the data from HDFS is it possible to load the data to hive?

Yes, you can use the files placed on the local system. Another option is to copy the other tables data t create a new table. But in that case also, indirectly you are using the HDFS data.

So, the best option to load data in Hive without copying in HDFS is to use local inpath command while loading data in Hive table as below.

Load data local inpath ‘localfilepath’ into table tbl_name

9. Without inserting or without loading the data to the hive is it possible to load the data to hive?

Yes, we can do so by creating a table from anther existing table. Here is the syntax for that-

CREATE TABLE tbl1 row format delimited fields terminated by ‘\t’ AS select * from tbl2

You can also create an external table in Hive using HBase as below-

CREATE EXTERNAL TABLE hbase_hive_names(fields_names) STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’ WITH SERDEPROPERTIES (“hbase.columns.mapping” = “:key,id:id,name:fn,name:ln,age:age”) TBLPROPERTIES(“hbase.table.name” = “hbase_2_hive_names”);

10. What is high definition cluster?

High definition cluster is similar to High Availability cluster. You can check our details article Design Hadoop cluster for the details.

High availability (HA) Hadoop cluster is not but the group of the system which acts like a single system and provides high uptime.

An HA cluster is usually used for load balancing, backup, failover and disaster recovery (DR) purpose.

11. What is Flume? Do you have any knowledge on Flume?

Explain a bit about Flume and why it is used.

Flume is a reliable distributed system to aggregate a large amount of real-time streaming data in HDFS. Flume is used in Hadoop ecosystem to get the data from the sources like Twitter, Facebook, and other sites to Hadoop systems.

The primary use of Flume is to gather the social media data to analyze further. We can call for sentiment data analysis; we take help of Flume to gather the data.

12. Where will the data store in HDFS?

The data will be stored in HDFS in the directory specified in dfs.datanode.data.dir and /dfs/data suffix that you see in the default value will not be appended.

You can find this information in hdfs-site.xml easily. But remember if you edit hdfs-site.xml, you need to restart the DataNodes services for the changes to take place.

You should also check HDFS file processing for the detailed operation.

13. In HDFS the data will store why again we need to load the data to HIVE?

Actually, Hive does not store anything. It is the HDFS which store the data which we put into Hive table.

The main reason to load the data again in Hive table is for further computation. For example, you have partitioned table, and so the system will make directories based on the partition and will keep file related to that partition.

Now if you will manually place the file, that won’t be accessible for any operation. So to avoid such conditions, we load the data in Hive table even we have files in HDFS location.

14. What is Hive Partition and have you used hive partition?

Hive partition is an important concept in Hive and is one of the major factors for Hive performance tuning.

Let’s consider a scenario where you have a table having the population of India. It will have billions of records…right?

Now if you have to find some records from the table how much time will it take? A lot more than the usual which is not a good practice.

To avoid this, Partition has been allowed on the Hive table.

15. What is Hive Partition?

It is a way to divide the Hive table into different related parts based on the partitioned columns such as city, date, country, etc.

It is very easy to query the partitioned table as the data is already structured. Here is the syntax to create partitioned table-

create table population_partition(
pID int,
Name string,
Address string
)partitioned by (state string)
row format delimited
fields terminated by ‘\t’
stored as textfile

Remember partitioned column should not be written while creating the columns.

Now if you want to load the data in the partitioned column, follow the below steps-

load data inpath ‘user/hive/population/state.csv’
overwrite into table population_partition partition (state)

There are two types of Hive partitions-

  • Static Partition
  • Dynamic Partition

Static partition is the default partition in Hive, but you need to set some parameters to enable the dynamic partition. Here is the set of properties which need to be configured to allow the dynamic partition in Hive.

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

16. What is the purpose of using the Hadoop in your project?

Here you can explain as per your project. A simple answer can be like below-

We are working for an ecommerce company, and we receive the log file based on the sessions from our client. Now explain how the log file comes, and then you parse the file.

Again use any tool like Hive/Pig to analyze those file and get the required data like the popular products which use clicked most.

Then say you are using some tools like Tableau to present the data in graphical form to share with the client.

Remember this is just a typical case, and you may explain as per your project. Here are few Hadoop use cases which you can consider while explaining- Big Data ECommerce case studies, Hadoop use cases in Education, Big Data use cases in banking and financial services.

17. Where is the production environment for your project?

Again the answer will depend on your project. You can say the location of your production server and then share the configurations etc.

18. What is the default path where the hive data will store?

The default path where hive stores data in HDFS is /usr/hive/warehouse.

But you can change/configure the default location as per your need from the Hive configuration file hive-site.xml.

Inside this directory, you will find the directories depending on the hive tables you have created.

You will find a section in hive-site.xml like below-

<property>
<name>hive.metastore.warehouse.dir</name>
<value>Your_Path_HERE</value>
<description>location of default database for the warehouse</description>
</property>

Now put the path you want in place of “your_path_HERE”. One thing you should note here is, the path you are providing should have Read/Write/Execute permissions.

For that you can execute the below-

sudo chown -R user
sudo chmod -R 777

And then you should restart the Hadoop services using stop and start commands-

stop-all.sh
start-all.sh

19. Explain syntax to create hive table and load data into the hive. Also, share its mandatory items?

This is relatively a simple question, and you can refer below syntax.
Hive Create table syntax (internal table)

Create table tblname (
Id int, name string..)
Row format delimited fields terminated by ‘\t’

Hive load data syntax (from local system)

Load data local inpath ‘local_pth’ into table tblname;

Hive load data syntax (from HDFS location)

Load data inpath ‘local_pth’ into table tblname;

Hive Create table syntax (internal table)

Create external table Etblname (
Id int, name string..)
Row format delimited fields terminated by ‘\t’
Location ‘hdfs_external_location’

What are the different tables in Hive?

There are mainly two types of Hive tables-

• External table
• Internal table

Below are the syntaxes to create both the table.

Hive Create table syntax (internal table)

Create table tblname (
Id int, name string..)
Row format delimited fields terminated by ‘\t’

Hive Create table syntax (internal table)

Create external table Etblname (
Id int, name string..)
Row format delimited fields terminated by ‘\t’
Location ‘hdfs_external_location’

Difference between Hive External table and Hive Internal table

For the difference between Hive external and internal table, please check Q9 of Hadoop scenario based questions.

There is one more table that we use in the hive, and that is Skewed Table.

Hive Skewed Table

A skewed table is a special type of table where the values that appear very often (heavy skew) are split out into separate files and rest of the values go to some other file.

Create Hive Skewed Table Syntax

create table T (c1 string, c2 string) skewed by (c1) on (‘x1’)

The skewed table is again one of the important Hive query optimization techniques.

20. Explain Hive Internal table vs. Hive External table. If we have two tables (i.e. internal table & external table) and we drop both the tables. Then if we run (select * from external/internal table ) query on both external and internal table. Whether can we get the output. Then what will be the output.

Please refer the above question for the difference between an external table and internal table.

Now as we know if we drop the internal table, both metadata and data will be lost. While if we will delete the external table, only metadata will be lost and not the actual data as it will be in some other directory.

Now if you will drop both the table and then do a query, it will show you the error message- “table not found”.

21. A Table has two rows (emp_name and emp_salary). How to get the Max salary without using the Max keyword or without using UDFs?

There are many methods to do so like below-

Using MIN keyword:

select MIN(-1 * col)*-1 as col from tableName;

Using self-join

select A.col, B.col
from tableName as A, tableName as B
where A.col < B.col

Using Limit

SELECT column
FROM YOUR_TABLE
ORDER BY column DESC
LIMIT 1

22. How to create a Hive table in AVRO format?

Here is the syntax to create Hive table in Avro format-

CREATE EXTERNAL TABLE avro_hive_table
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.avro.AvroSerDe’TBLPROPERTIES
(‘avro.schema.url’=’hdfs://localdomain/user/avro/schemas/activity.avsc’)
STORED AS INPUTFORMAT ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat’
OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat’
LOCATION ‘/user/avro/applog_avro’;

These were the Capgemini Hadoop interview questions and answers from a recent Capgemini interview. These questions were combined from both the round.

These Hadoop interview questions are not just limited to Capgemini but can be asked in any interview. So prepare these Hadoop interview questions and answers well. You can also check Hadoop scenario based Hadoop interview questions here.

Good luck and if you have appeared for any Hadoop interview recently, do share your experience with us.

2 Comments

  1. rajapriyainchennai1@gmail.com' Raja
  2. rajapriyainchennai1@gmail.com' Raja Priya

Leave a Reply