Learn Basic of HDFS in Hadoop
HDFS is distributed file system used by apache Hadoop. Hadoop used HDFS for storing large data say petabytes of data. This data stores the data and distributes the data in the different machines in clustered architecture, because of the distribution of data over multiple machines, it is highly available in the process of the data. HDFS runs on low-cost hardware.
How this data is stored in HDFS:-
Hadoop runs on different machines which are named clusters. and process a data huge amount of data.
Each cluster has a set of nodes called data node and each node has a set of blocks. and the set of data is duplicated in different nodes in a clustered environment. This duplication is what we called as replication in the database environment. so a set of data is available in different nodes or machines so that you can’t lose the data when one is down.
The number of nodes where this data is replicated is configured in the Hadoop system, To control all these nodes, we have the name node concept. which sync with a set of data nodes in a cluster environment to know the health of the nodes and also stores metadata about the nodes as well as blocks.
if data is grown rapidly, we can add nodes without failover of the whole system and losing the data.
This we can call a scalable system in network terminology.
This system handles the case of losing data while adding machines to the existing machines or after the machine adds to the cluster. As you know cluster has different nodes, if one node fails, Hadoop handles the scenario without losing the data and serves the required work as expected.
HDFS stores the data in the files where this file uses the underlying system.
HDFS is suitable for storing a large amount of data like Peta and terabytes of data which process the data using Map-reduce for OLAP transactions.
Let us take the scenario where you have 12 pages of pdf files you want to store in the HDFS system. Assume that each page has one block, it might be different in the real system.
Name Node holds three items( filename, number of blocks, block ids)
- Filename which is the page number of pdf file stored under hdfs file system example (/pdf/page1)
- Number of blocks represents the count of blocks where this file is stored in hdfs
- block ids represent a reference to the blocks in the name node
Data nodes hold the page data in different blocks. Page 1 is replicated in three blocks which are of id’s 1,5,6 These blocks are on different machines Here is the summary of pages stored in different nodes on clusters.
page1,3,6 Page2,3,2 Page3,3,1
Because of the replication of data, data will never be lost even when the data node is down. The communication of the name node and data node can be through TCP protocol.
HDFS is completely written on the java framework. Data stored in hdfs can be controlled using commands as well as java hdfs APIs provided by apache. Commands are executed on top of the underlying operating system which in turns calls java APIs to interact with the file system.
Hope you got a little bit idea of the ocean of HDFS.
Please leave a comment if you like this post.