This Formula to calculate HDFS node storage is equally important for both practical Hadoop practice and Hadoop interview. So let’s get started with how to calculate HDFS node storage?
Here is the formula to find the HDFS storage required while building the Hadoop cluster. Many times when you are starting a Hadoop project from scratch such things you will have to consider.
Mainly Hadoop architect uses to tell about the HDFS storage required but you can also find the HDFS storage required using the below formula.
Formula to calculate HDFS node storage (H) required
Here are the meanings of the above abbreviations-
- H: denoted the HDFS node storage required
- C: C is the compression ratio and completely depends on the type of compression used and size of the data. For example, for LZOP it is 0.8. When no compression used, the value of C will be 1.
- R: It is the replication factor which is 3 by default in production cluster. You may increase or decreases depending on your need and criticality of the data you have.
- S: This is the important part. S denotes the initial amount of data you need to move to Hadoop. It will be a combination of both historical and incremental data you have.
Here along with the initial data, you will also have to consider the incremental data and the data produced by the MapReduce jobs.
For example, currently, you have 100TB of data and is expected to increase by 10 TB over a period of 3-6 months and around 10% of the data will be contributed by MapReduce jobs. SO overall, you should keep the initial size as 100+10+10 TB= 120 TB as initial data size at least.
- I: I represent the intermediate data factor which is usually 1/3 or ¼. It is Hadoop’s intermediate working space used to store the intermediate results of different tools like Hive, Pig etc. It is recommended to use for production applications and even Cloudera recommend 25% for intermediate results.
- 1.2: 1.2 or 120% more than the total size. This is because we have to provide room for HDFS underlying systems.
So if you have the HDFS storage as 120 TB then you are advised to keep the data only 100 TB.
So based on the above assumption, let’s calculate how much HDFS storage you would require for Hadoop cluster-
Find HDFS Node Storage
Considering C=1; R=3; and i=0.25, below will be the HDFS node storage size-
H= 1*3*S/(1-0.25)= 4S
That means, with the default values, Hadoop Storage should be at least 4 times the initial data size.
Hope you got an idea as for how to find the HDFS storage for Hadoop applications. This is the formula to calculate HDFS Node Storage easily.
Have any doubt, comment below!