Administering Hadoop

Namenode directory structure :-
——————————————

A newly formatted namenode creates the following directory structure:

${dfs.name.dir}/current/VERSION
/edits
/fsimage
/fstime

In my machine

[root@myhostname current]# pwd
/data/2/hadoop/tmp/dfs/name/current
[root@myhostname current]#
[root@myhostname current]# ll -lhtr
total 16K
-rw-r–r– 1 root root 110 Jul 22 07:21 fsimage
-rw-r–r– 1 root root 4 Jul 22 07:21 edits
-rw-r–r– 1 root root 8 Jul 22 07:21 fstime
-rw-r–r– 1 root root 101 Jul 22 07:21 VERSION
[root@myhostname current]#

The VERSION file is a Java properties file that contains information about the version of HDFS that is running. Here are the contents of a typical file:

[root@myhostname current]# cat VERSION
#Tue Jul 22 07:21:16 EDT 2014
namespaceID=1968183180
cTime=0
storageType=NAME_NODE
layoutVersion=-41
[root@myhostname current]#
[root@myhostname current]#

The layoutVersion is a negative integer that defines the version of HDFS’s persistent data structures. This version number has no relation to the release number of the Hadoop distribution. Whenever the layout changes, the version number is decremented (for example, the version after −18 is −19). When this happens, HDFS needs to be upgraded, since a newer namenode (or datanode) will not operate if its storage layout is an older version.

The namespaceID is a unique identifier for the filesystem, which is created when the filesystem is first formatted. The namenode uses it to identify new datanodes, since they will not know the namespaceID until they have registered with the namenode.

The cTime property marks the creation time of the namenode’s storage. For newly formatted storage, the value is always zero, but it is updated to a timestamp whenever the filesystem is upgraded.

The storageType indicates that this storage directory contains data structures for a namenode.

The other files in the namenode’s storage directory are edits, fsimage, and fstime. These are all binary files, which use Hadoop Writable objects as their serialization format.

The filesystem image and edit log

When a filesystem client performs a write operation (such as creating or moving a file), it is first recorded in the edit log. The namenode also has an in-memory representation of the filesystem metadata, which it updates after the edit log has been modified. The in-memory metadata is used to serve read requests.
The edit log is flushed and synced after every write before a success code is returned to the client. For namenodes that write to multiple directories, the write must be flushed and synced to every copy before returning successfully. This ensures that no operation is lost due to machine failure.

The fsimage file is a persistent checkpoint of the filesystem metadata. However, it is not updated for every filesystem write operation, since writing out the fsimage file, which can grow to be gigabytes in size, would be very slow. This does not compromise resilience, however, because if the namenode fails, then the latest state of its metadata
can be reconstructed by loading the fsimage from disk into memory, then applying each of the operations in the edit log. In fact, this is precisely what the namenode does when it starts up

As described, the edits file would grow without bound. Though this state of affairs would have no impact on the system while the namenode is running, if the namenode were restarted, it would take a long time to apply each of the operations in its (very long) edit log. During this time, the filesystem would be offline, which is generally
undesirable.

The solution is to run the secondary namenode, whose purpose is to produce checkpoints of the primary’s in-memory filesystem metadata.* The checkpointing process proceeds as follows (and is shown schematically in Figure):

1. The secondary asks the primary to roll its edits file, so new edits go to a new file.
2. The secondary retrieves fsimage and edits from the primary (using HTTP GET).
3. The secondary loads fsimage into memory, applies each operation from edits, then creates a new consolidated fsimage file.
4. The secondary sends the new fsimage back to the primary (using HTTP POST).
5. The primary replaces the old fsimage with the new one from the secondary, and the old edits file with the new one it started in step 1. It also updates the fstime file to record the time that the checkpoint was taken.

At the end of the process, the primary has an up-to-date fsimage file and a shorter edits file (it is not necessarily empty, as it may have received some edits while the checkpoint was being taken). It is possible for an administrator to run this process manually while the namenode is in safe mode, using the hadoop dfsadmin -saveNamespace command.

This procedure makes it clear why the secondary has similar memory requirements to the primary (since it loads the fsimage into memory), which is the reason that the secondary needs a dedicated machine on large clusters.

  • Ask Question