Analytics Hadoop

Datasets for Hadoop Practice

In this Datasets for Hadoop Practice tutorial, I am going to share few free Hadoop data sources available for use. You can download these and start practicing Hadoop easily.

Datasets for Hadoop Practice

I have compiled the list of datasets available and have shortlisted around 10 datasets for Hadoop practice. Working on these datasets will give you the real example and experience of Hadoop and its ecosystems.

Top Hadoop Datasets for Practice

Here is the list of Free Hadoop Datasets for practice-

1. clearbits.net: It provides a quarterly full data set of stack exchange. Around 10 GB of data, you can get from here and is an ideal location for Hadoop dataset for practice.
2. grouplens.org: A great collection of datasets for Hadoop practice is grouplens.org. Check the site and download the available data for live examples.
3. Amazon: It’s no secret that Amazon is among market leaders when it comes to cloud. AWS is being used on a large scale with Hadoop. Also, Amazon provides a lot of datasets for Hadoop practice. You can download these.
4. University of Waikato: This University provides a quality data set for machine learning.
5. ClueWeb09: 1 billion web pages collected between Jan and Feb 09. 5TB Compressed.
6. Wikipedia: Yes! Wikipedia also provides datasets for Hadoop practice. You will have refreshed and real data to the use.
7. ICS: You will find a huge collection of 180+ datasets here
8. LinkedData: You may find almost all categories of datasets here.
9. AWS Public datasets: Here AWS officially provides datasets for example
10. RDM: List of a large number of free datasets for practice.

These were the list of datasets for Hadoop practice. Just use these datasets for Hadoop projects and practice with a large chunk of data.

These are free datasets for Hadoop and all you have to do is, just download big data sets and start practicing.

Also, if you have Hadoop installed in your PC, you can also find the Hadoop Datasets in the below locations-

• hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar randomwriter /random-data: generates 10 GB data per node under folder /random-data in HDFS.
• hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar randomtextwriter /random-text-data: generates 10 GB textual data per node under folder /random-text-data in HDFS.

Which dataset do you use for Hadoop practice?

1 Comment

Leave a Comment