Big Data Hadoop YARN

HDFS Federation vs High Availability [An Ultimate Guide]

HDFS federation vs high availability: No matter which of these two we are talking about, both came from Hadoop 2.

Earlier, we discussed a lot about the single point of failure with Hadoop 1 in our HDFS tutorial and were looking for the high availability. And the same got implemented with Hadoop 2.

If you will look for the Hadoop 1 vs Hadoop 2, you can clearly identify how risk was involved while working with Hadoop1.

HDFS federation vs high availabilityNow when with Hadoop 2, namenode is no longer a single point failure system as we have multiple namenode options, two more stuff came into the picture-

  • Hdfs federation and
  • High Availability

Both offer you more than one namenode in a different fashion and we are going to talk about those here in this HDFS federation vs high availability.

But before looking into the difference between HDFS federation and high availability, let’s find some details about both.

HDFS Federation

Hdfs federation basically enhances the existing architecture of HDFS. Earlier with Hadoop 1, entire cluster was using a single namespace. And there a single namenode was managing that namespace. Now if that namenode was failing, the Hadoop cluster used to be down.

The Hadoop cluster used to be unavailable until that namenode was coming back to the cluster causing the loss of resources and time.

Hdfs federation with Hadoop 2 comes over this limitation by allowing the uses of more than one namenode and thus more than one namespace.

If you will look into the Hadoop 1 HDFS architecture, you can clearly find how namenode was the single point failover for the Hadoop cluster.

HDFS Federation Architecture

Now just look at the HDFS federation architecture-

HDFS Federation ArchitectureCan you find here, there are multiple namenode (NN1, NN2…NNn) and their multiple associated namespaces (NS1, NS2…NSn)?

Yes, in this way, the HDFS federation overcame on the issue of single point failover of namenode in the Hadoop cluster. And at the bottom, those are the shared storage of datanodes which is the same as we had earlier.

  • Block Pool- Block pool is nothing but the set of blocks that belong to a single namespace. There are set of blocks in the HDFS federation and each block is managed independently.
  • Namespace Volume- Namespace along with its block pool is namespace volume. You can find many namespace volumes in HDFS federation. Again, all these namespace volumes work independently.

When we delete the namenode/namespace, the corresponding block pool present on the datanode will also get deleted.

Hdfs federation benefits

Here are some of the leading advantages of HDFS federation-

  • Isolation- Earlier with Hadoop 1 there was no isolation involved with multi-user scenario. This has been overcome by using the HDFS federation where different applications can be isolated to a different namespace by using multiple namenode.
  • Namespace scalability- With HDFS federation, many namenodes horizontally scaled up in the filesystem namespace
  • Enhanced performance- By adding more namenodes, the system boosts the read and write level performance with the significant factor.

High Availability

High availability simply means there should be more than one instance of any services or products. This is required because if one will go down also, there will be the backup available which can simply take that up.

HDFS federation vs high availabilityThe same work in Hadoop also. With Hadoop 2 and above, we have now normally two namenode which are in active-passive fashion. That means, at a time the active namenode will be up and running and passive/standby namenode will be idle. When the active namenode will go down, the passive will be automatically up to keep the cluster up and running all the time. Thus, it provides the high availability for Hadoop cluster.

So, in case of high availability (HA), you should have at least two separate machines configured as namenode where active and standby namenode will be configured.

HDFS federation vs high availability- the conclusion

So, far we have discussed a lot about both HDFS federation and high availability. Let’s conclude the discussion here with some conclusion about HDFS federation vs high availability.

The major difference between HDFS federation and high availability is, in HDFS federation, namenodes are not related to each other. Here all the machine shares a pool of metadata in which each namenode will have its dedicated own pool. And by this way, HDFS federation provides fault tolerance.

Here fault tolerance means if one namenode will go down also, that will not affect the data of other namenode. That means HDFS federation means, multiple namenodes which are not correlated.

In the case of HA, both active and standby namenode will work all the time. At a time, active namenode will be up and running while standby namenode will be idle and simply updating it’s metadata once in a while to keep it updated. Once primary namenode will go down, the standby will take up the place with the most recent metadata it has.

So, in the case of HDFS HA, you must have two separate machines. On first, the active namenode will be configured while secondary on the other system.

HDFS federation vs high availability
  • Difference between HDFS Federation and High Availability
5

Summary

These were the differences between HDFS federation and high availability. Both HDFS federation and high availability are a part of Hadoop 2 and ensure that the cluster is not going down even if one namenode goes down.

1 Comment

  • Hi Team.

    Thanks so much for such a detailed comparison. You have mentioned Hadoop 3 here as well. Is that out for production?

Leave a Comment