[Solved] Difference Between HDFS and NAS
Hello All, I want to know the difference between HDFS and NAS, I am preparing some important topics related to Hadoop for my upcoming interview. Can anyone tell me which one is best in terms of cost and data storage HDFS or NAS?
A distributed file system is mainly designed to hold a large amount of data and provide access to this data to many clients distributed across a network. There are a number of distributed file systems that solve this problem in different ways. The oldest and very popular is NFS — Network File System.
But NFS has got many limitations as a distributed file system.
1. The files reside on on single machine.
2. It does not provide any reliability guarantees if that machine goes down, this means that it will only store as much information as can be stored in one machine.
3. Finally, as all the data is stored on a single machine, all the clients must go to this machine to retrieve their data. This can overload the server if a large number of clients must be handled. Clients must also always copy the data to their local machines before they can operate on it.
To overcome above drawbacks, there came a file system — HDFS (Hadoop Distributed File System.)
1. HDFS is designed to store a very large amount of information (terabytes or petabytes). This requires spreading the data across a large number of machines. It also supports much larger file sizes than NFS.
HDFS should store data reliably. If individual machines in the cluster malfunction, data should still be available.
2. HDFS should provide fast, scalable access to this information. It should be possible to serve a larger number of clients by simply adding more machines to the cluster.
3. HDFS should integrate well with Hadoop MapReduce, allowing data to be read and computed upon locally when possible.
But, HDFS has also got some limitations.
1. HDFS is optimized to provide streaming read performance; this comes at the expense of random seek times to arbitrary positions in files.
2. Data will be written to the HDFS once and then read several times; updates to files after they have already been closed are not supported.
3. Due to the large size of files, and the sequential nature of reads, the system does not provide a mechanism for local caching of data.
4. Individual machines are assumed to fail on a frequent basis, both permanently and intermittently. The cluster must be able to withstand the complete failure of several machines, possibly many happening at the same time.