badge

PIG Interview Questions and Answers

This is the first section of our Interview series where we will be sharing different Hadoop interview questions and answers.

PIG interview questions and answers are the very first section of this series and we will be taking you through different questions being asked on PIG in Hadoop interviews.

PIG Interview Questions and Answers

This PIG interview questions and answers series has been finalized based on the input provided by various candidates in different Big Data interviews.

Note: You can also help us by sharing the questions you think can be asked or those which you have faced in comment. You can also suggest the answers of questions and we will include that in this PIG interview questions and answers.

Also, we have tried to make the answers precise and short so that you can get the message what we want to convey and explain well. So, let’s get started.

Here are the list of questions those are asked in Hadoop interviews as a part of PIG. We will keep on updating this post frequently and so if you are preparing for an interview, you should definitely keep on checking these pig questions and answers regularly. You can also subscribe us to get notified.

1. What is PIG?

PIG is one of the top-level projects of Apache software foundation which is a part of Hadoop ecosystems and provides the engine for data flow in parallel in Hadoop.

It has a language called, PIG Latin which is used to express the data flow which works on the top of MapReduce. PIG was initially developed by Yahoo which later got donated to Apache.

The ability of procedural extension of Pig language makes it highly recommendable for ETL (Extract Transform Load). Pig can also be used as an Ad-Hoc data analysis.

PIG was developed based on one philosophy and that is Pigs can eat anything, live anywhere, can be easily controlled and modified by the user.

3. What is BloomMapFile?

BloomMapFile is a class that extends the MapFile class. It is used in HBase table format to provide quick membership test for the keys using dynamic bloom filters.

2. When we write a= load …, what does ‘a’ called?

Here ‘a’ is called as Relation in PIG

4. What are the complex data types in PIG?

Map, Tuple, and Bag are the three complex data types in PIG and those are as below-

Map: It is a collection of data element where elements have PIG data type. Usually, it is an unstructured data type.

Tuple: Tuple is a collection of Map, called Field. A tuple can have multiple fields and can be of different data types.

Bag: It is a collection of Tuples. It holds the entire tuple and map data and represented in {}.

For example, let’s consider this- {(‘Noida’, ‘201301’), ([‘area’ ’#’ ‘Sec 15’, ‘PIN’#201301])}

Here in PIG complex data types, it will be like- {is bag, (is tuple, [is Map

5. What are the differences between PIG and MapReduce?

Below are some of the basic differences between PIG and MapReduce-

PIG MapReduce
Apache Pig is a dataflow language. MapReduce is a data processing paradigm.
It is a high-level language. MapReduce is low level and rigid.
Performing a Join operation in Apache Pig is pretty simple. It is quite difficult in MapReduce to perform a Join operation between datasets.
Any novice programmer with a basic knowledge of SQL can work conveniently with Apache Pig. Exposure to Java is a must to work with MapReduce.
Apache Pig uses multi-query approach, thereby reducing the length of the codes to a great extent. MapReduce will require almost 20 times more the number of lines to perform the same task.
There is no need for compilation. On execution, every Apache Pig operator is converted internally into a MapReduce job. MapReduce jobs have a long compilation process.

6. What are the differences between PIG and SQL

Here are some of the major differences between PIG and SQL-

Pig

SQL

Pig Latin is a procedural language. SQL is a declarative language.
In Apache Pig, the schema is optional. We can store data without designing a schema (values are stored as $01, $02 etc.) Schema is mandatory in SQL.
The data model in Apache Pig is nested relational. The data model used in SQL is flat relational.
Apache Pig provides limited opportunity for Query optimization. There is more opportunity for query optimization in SQL.

7. What are the differences between PIG and HIVE

This is the very common questions being asked in many interviews. Here are some of the point to point differences between PIG and Hive.

Pig

Hive

Apache Pig uses a language called Pig Latin. It was originally created at Yahoo. Hive uses a language called HiveQL. It was originally created on Facebook.
Pig Latin is a data flow language. HiveQL is a query processing language.
Pig Latin is a procedural language and it fits in pipeline paradigm. HiveQL is a declarative language.
Apache Pig can handle structured, unstructured, and semi-structured data. Hive is mostly for structured data.

8. How does PIG work?

Every time you write a pig script and run, it gets transformed into MapReduce program and runs above HDFS.

9. What are the different EVAL functions available in PIG?

Below are some of the EVAL functions available in PIG-

AVG
CONCAT
MAX
MIN
SUM
SIZE
COUNT
COUNT_STAR
DIFF
TOKENIZE
IsEmpty

10. What are different String functions available in PIG?

Below are some of the important PIG STRING functions available-

UPPER
LOWER
TRIM
SUBSTRING
INDEXOF
STRSPLIT
LAST_INDEX_OF

11. What is the use of foreach operation in Pig scripts?

Foreach is used to apply a transformation to each element in Data Bag. It will further generate new data items.

Eg. A = LOAD ‘data’ AS (f1,f2,f3);

B= Foreach A Generate F1+5;

12. What is Flatten and what it do in PIG?

Sometimes data are in Bag and Tuple and if we want to remove the level of nesting from the data, Flatten is being used. It is a modifier similar to UDF (but powerful than UDF) which un-nest the Bag and Tuple.

13. What are the different modes in which PIG can run and explain those?

There are couple of modes in which PIG can run and those are as below-

Local Mode: Runs on the Local file system and doesn’t even need Hadoop to be installed.
MapReduce Mode: Runs on Hadoop cluster. It is necessary to start Hadoop and both script and data will be stored in HDFS.

14. What are the debugging tools used for Apache Pig scripts?

There are mainly three ways to debug a PIG script-

Describe: Review the schema. You can view the schema with describe
Eg. grunt> student = LOAD ‘hdfs://localhost:9000/pig_data/student_data.txt’ USING PigStorage(‘,’) as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );

Now run, describe on Student-

grunt> describe student;

and you will get the below output-

grunt> student: { id: int,firstname: chararray,lastname: chararray,phone: chararray,city: chararray }

Explain: Logical, physical and MapReduce execution plans
Illustrate: Step by Step execution of each step
If you will do, Illustrate student, you will have the output like below-

grunt> illustrate student;

INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$M ap – Aliasesbeing processed per job phase (AliasName[line,offset]): M: student[1,10] C: R:———————————————————————————————|student | id:int | firstname:chararray | lastname:chararray | phone:chararray | city:chararray |——————————————————————————————— | | 002 | siddarth | Battacharya | 9848022338 | Kolkata |———————————————————————————————

15. What are the different execution modes available in PIG?

Below are the three different execution modes available in PIG-

Interactive Mode (Also known as Grunt Mode)
Note: Pig interactive shell is known as Grunt Shell. It provides a shell for users to interact with HDFS.

Batch Mode
Embedded Mode

2 Comments

  1. qejrzd@iwhcob.com' Evelyn Serrell
  2. lmpwtjci@qxvhdagbiuq.com' Natalie Murray

Leave a Reply