Examples Pig

Word Count Example in Pig Latin

When you will start with the Analytics then Word count examples are first few things those you will have to do. Here I will be talking about the Word count example in Pig Latin.

We can do the word count in Hive as well but Pig is more recommended due to the supported functions.

So let’s start with the Pig Word count example in an easy way. I am also doing this word count example in Pig Latin for this post and will be sharing the screenshots as well for better understanding.

Word Count Example in Pig Latin

To start with the word count in pig Latin, you need a file in which you will have to do the word count. Recently I was working on a client data and let me share that file for your reference.

It is a PDF file and so you need to first convert it into a text file which you can easily do using any PDF to text converter.

Here are both PDF and Text file for your reference. It is recommended for you to download both the file to start with word count example in pig Latin.

Text file link
PDF File link

● Now open your command prompt and enter “pig” to open the grunt shell.

Load the data from HDFS in PIG

● Once you will be in the grunt shell, define a relation to load the data using the below query.

A= LOAD '/user/cloudera/pig/Scalable Sentiment Classification for Big Data.txt' USING PigStorage('\t') as (pdfdata:CHARARRAY);

Word Count Example in Pig Latin
● You can check the data using dump command like below as how it came-

Dump A;

This process may take a bit of time as MapReduce job run. Once the job will get completed, you will have the output like below-

Word Count Example in Pig Latin

Convert Sentence into words

Now we have a sentence in ‘A’, we have to convert it into words. For this purpose, we can take help of Tokenize operator of PIG Latin.

(TOKENIZE(line));

(or)

If we have any delimiter like space we can specify as

(TOKENIZE(line,' '));

In this example of word count example in Pig latin, we will be using (TOKENIZE(line,’ ‘)); as have space separated file.

This will give output like this-

{(fusion in a cloud-enabled environment,” High Performance Semantic),
(Cloud Auditing, Springer Publishing, 2013.)}

But we need the output something like below-

(fusion in a cloud-enabled environment,” High Performance Semantic)
(Cloud Auditing, Springer Publishing, 2013.)

Convert column into rows

Here we will have to convert each word as a single row. For this PIG has an inbuilt function FLATTEN.

eachrow = FOREACH A GENERATE FLATTEN(TOKENIZE(pdfdata,' ')) AS word;

Check the output using Dump command-

Dump eachrow;
Word Count Example in Pig Latin

Group the words

We have all the words in row form individually and now we have to group those words together so that we can count.
Use the below command for this purpose-

groupword= Group eachrow by word;

Generate count for word count in Pig

We are the last step of this word count in Pig Latin. Here just you need to count the earlier grouped result.
For this just use the below query-

pdfwordcount= FOREACH groupword GENERATE group, COUNT(eachrow);

Now check the output using the DUMP command-
DUMP pdfwordcount;

The output will be like below-

Word Count Example in Pig Latin

You are done. This was all about Word count example in Pig Latin. If you want to do further analysis on this file like checking the top 5 words then you can save the result for further analysis.
You can use the below command to save the result in HDFS.

grunt> store pdfwordcount into '/user/cloudera/pig/wordcount/';

The output file will be inside wordcount directory with the default file name part-r-00000. Make sure the directory in which you are saving the file doesn’t exist else it will throw an error.

Hope you enjoyed this article on Word Count in Pig Latin. Do try this yourself and let me know for any doubt.

Also, subscribe to get more examples like this.

 

2 Comments

Leave a Comment