Hadoop Archives & MR

Overview
--------
Hadoop archives are special format archives. A Hadoop archive maps to a file system directory. A Hadoop archive always has a *.har extension. A Hadoop archive directory contains metadata (in the form of _index and _masterindex) and data (part-*) files. The _index file contains the name of the files that are part of the archive and the location within the part files.
How to Create an Archive
-------------------------
Usage: hadoop archive -archiveName name -p <parent> <src>* <dest>
-archiveName is the name of the archive you would like to create. An example would be dbversity.har. The name should have a *.har extension. The parent argument is to specify the relative path to which the files should be archived to. Example would be :
-p /dbversity/bar a/b/c e/f/g
Here /dbversity/bar is the parent path and a/b/c, e/f/g are relative paths to parent. Note that this is a Map/Reduce job that creates the archives. You would need a map reduce cluster to run this. For a detailed example the later sections.
If you just want to archive a single directory /dbversity/bar then you can just use
hadoop archive -archiveName zoo.har -p /dbversity/bar /outputdir
How to Look Up Files in Archives
--------------------------------
The archive exposes itself as a file system layer. So all the fs shell commands in the archives work but with a different URI. Also, note that archives are immutable. So, rename's, deletes and creates return an error. URI for Hadoop Archives is
har://scheme-hostname:port/archivepath/fileinarchive
If no scheme is provided it assumes the underlying filesystem. In that case the URI would look like
har:///archivepath/fileinarchive
Archives Examples
Creating an Archive
-------------------
hadoop archive -archiveName dbversity.har -p /user/hadoop dir1 dir2 /user/zoo
The above example is creating an archive using /user/hadoop as the relative archive directory. The directories /user/hadoop/dir1 and /user/hadoop/dir2 will be archived in the following file system directory -- /user/zoo/dbversity.har. Archiving does not delete the input files. If you want to delete the input files after creating the archives (to reduce namespace), you will have to do it on your own.
Looking Up Files
----------------
Looking up files in hadoop archives is as easy as doing an ls on the filesystem. After you have archived the directories /user/hadoop/dir1 and /user/hadoop/dir2 as in the example above, to see all the files in the archives you can just run:
hadoop dfs -lsr har:///user/zoo/dbversity.har/
To understand the significance of the -p argument, lets go through the above example again. If you just do an ls (not lsr) on the hadoop archive using
hadoop dfs -ls har:///user/zoo/dbversity.har
The output should be:
har:///user/zoo/dbversity.har/dir1
har:///user/zoo/dbversity.har/dir2
As you can recall the archives were created with the following command
hadoop archive -archiveName dbversity.har -p /user/hadoop dir1 dir2 /user/zoo
If we were to change the command to:
hadoop archive -archiveName dbversity.har -p /user/ hadoop/dir1 hadoop/dir2 /user/zoo
then a ls on the hadoop archive using
hadoop dfs -ls har:///user/zoo/dbversity.har
would give you
har:///user/zoo/dbversity.har/hadoop/dir1
har:///user/zoo/dbversity.har/hadoop/dir2
Notice that the archived files have been archived relative to /user/ rather than /user/hadoop.
Hadoop Archives and MapReduce
-----------------------------
Using Hadoop Archives in MapReduce is as easy as specifying a different input filesystem than the default file system. If you have a hadoop archive stored in HDFS in /user/zoo/dbversity.har then for using this archive for MapReduce input, all you need to specify the input directory as har:///user/zoo/dbversity.har. Since Hadoop Archives is exposed as a file system MapReduce will be able to use all the logical input files in Hadoop Archives as input
Illustration :-
[root@dbversity.com 6]# hadoop fs -mkdir /tmp/archive
[root@dbversity.com 6]# hadoop fs -mkdir /tmp/archive_op
[root@dbversity.com 6]# hadoop fs -cp /tmp/*.txt /tmp/archive
[root@dbversity.com 6]# hadoop fs -ls /tmp/archive
Found 6 items
-rwxr-xr-x 1 root root 38 2014-11-07 14:53 /tmp/archive/input.txt
drwxr-xr-x - root root 4096 2014-11-07 14:53 /tmp/archive/output.txt
drwxr-xr-x - root root 4096 2014-11-07 14:53 /tmp/archive/output_new.txt
drwxr-xr-x - root root 4096 2014-11-07 14:53 /tmp/archive/output_new12.txt
-rwxr-xr-x 1 root root 139715 2014-11-07 14:53 /tmp/archive/test.txt
-rwxr-xr-x 1 root root 38 2014-11-07 14:53 /tmp/archive/wc.txt
[root@dbversity.com 6]# 
[root@dbversity.com 6]# hadoop archive -archiveName mongo_hadoop_archive.har -p /tmp archive/* archive_op
14/11/07 14:54:24 WARN conf.Configuration: session.id is deprecated. Instead, use dfs.metrics.session-id
14/11/07 14:54:24 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
14/11/07 14:54:24 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
14/11/07 14:54:24 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/11/07 14:54:25 INFO mapred.LocalJobRunner: OutputCommitter set in config null
14/11/07 14:54:25 INFO mapred.JobClient: Running job: job_local142539620_0001
14/11/07 14:54:25 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
14/11/07 14:54:25 INFO mapred.LocalJobRunner: Waiting for map tasks
14/11/07 14:54:25 INFO mapred.LocalJobRunner: Starting task: attempt_local142539620_0001_m_000000_0
14/11/07 14:54:25 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
14/11/07 14:54:25 INFO util.ProcessTree: setsid exited with exit code 0
14/11/07 14:54:25 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@57fa2a8a
14/11/07 14:54:25 INFO mapred.MapTask: Processing split: file:/tmp/hadoop-root/mapred/staging/root1969359586/.staging/har_p3wens/_har_src_files:0+758
14/11/07 14:54:25 WARN mapreduce.Counters: Counter name MAP_INPUT_BYTES is deprecated. Use FileInputFormatCounters as group name and BYTES_READ as counter name instead
14/11/07 14:54:25 INFO mapred.MapTask: numReduceTasks: 1
14/11/07 14:54:25 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
14/11/07 14:54:25 INFO mapred.MapTask: io.sort.mb = 100
14/11/07 14:54:26 INFO mapred.JobClient: map 0% reduce 0%
14/11/07 14:54:26 INFO mapred.MapTask: data buffer = 79691776/99614720
14/11/07 14:54:26 INFO mapred.MapTask: record buffer = 262144/327680
14/11/07 14:54:26 INFO mapred.MapTask: Starting flush of map output
14/11/07 14:54:26 INFO mapred.MapTask: Finished spill 0
14/11/07 14:54:26 INFO mapred.Task: Task:attempt_local142539620_0001_m_000000_0 is done. And is in the process of commiting
14/11/07 14:54:26 INFO mapred.LocalJobRunner: 
14/11/07 14:54:26 INFO mapred.Task: Task attempt_local142539620_0001_m_000000_0 is allowed to commit now
14/11/07 14:54:26 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local142539620_0001_m_000000_0' to file:/tmp/archive_opmongo_hadoop_archive.har
14/11/07 14:54:26 INFO mapred.LocalJobRunner: Copying file file:/tmp/archive/wc.txt to archive.
14/11/07 14:54:26 INFO mapred.Task: Task 'attempt_local142539620_0001_m_000000_0' done.
14/11/07 14:54:26 INFO mapred.LocalJobRunner: Finishing task: attempt_local142539620_0001_m_000000_0
14/11/07 14:54:26 INFO mapred.LocalJobRunner: Map task executor complete.
14/11/07 14:54:26 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
14/11/07 14:54:26 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1310be98
14/11/07 14:54:26 INFO mapred.LocalJobRunner: 
14/11/07 14:54:26 INFO mapred.Merger: Merging 1 sorted segments
14/11/07 14:54:26 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 510 bytes
14/11/07 14:54:26 INFO mapred.LocalJobRunner: 
14/11/07 14:54:26 INFO mapred.Task: Task:attempt_local142539620_0001_r_000000_0 is done. And is in the process of commiting
14/11/07 14:54:26 INFO mapred.LocalJobRunner: 
14/11/07 14:54:26 INFO mapred.Task: Task attempt_local142539620_0001_r_000000_0 is allowed to commit now
14/11/07 14:54:26 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local142539620_0001_r_000000_0' to file:/tmp/archive_opmongo_hadoop_archive.har
14/11/07 14:54:26 INFO mapred.LocalJobRunner: reduce > reduce
14/11/07 14:54:26 INFO mapred.Task: Task 'attempt_local142539620_0001_r_000000_0' done.
14/11/07 14:54:27 INFO mapred.JobClient: map 100% reduce 100%
14/11/07 14:54:27 INFO mapred.JobClient: Job complete: job_local142539620_0001
14/11/07 14:54:27 INFO mapred.JobClient: Counters: 21
14/11/07 14:54:27 INFO mapred.JobClient: File System Counters
14/11/07 14:54:27 INFO mapred.JobClient: FILE: Number of bytes read=636230
14/11/07 14:54:27 INFO mapred.JobClient: FILE: Number of bytes written=820075
14/11/07 14:54:27 INFO mapred.JobClient: FILE: Number of read operations=0
14/11/07 14:54:27 INFO mapred.JobClient: FILE: Number of large read operations=0
14/11/07 14:54:27 INFO mapred.JobClient: FILE: Number of write operations=0
14/11/07 14:54:27 INFO mapred.JobClient: Map-Reduce Framework
14/11/07 14:54:27 INFO mapred.JobClient: Map input records=10
14/11/07 14:54:27 INFO mapred.JobClient: Map output records=10
14/11/07 14:54:27 INFO mapred.JobClient: Map output bytes=488
14/11/07 14:54:27 INFO mapred.JobClient: Input split bytes=139
14/11/07 14:54:27 INFO mapred.JobClient: Combine input records=0
14/11/07 14:54:27 INFO mapred.JobClient: Combine output records=0
14/11/07 14:54:27 INFO mapred.JobClient: Reduce input groups=10
14/11/07 14:54:27 INFO mapred.JobClient: Reduce shuffle bytes=0
14/11/07 14:54:27 INFO mapred.JobClient: Reduce input records=10
14/11/07 14:54:27 INFO mapred.JobClient: Reduce output records=0
14/11/07 14:54:27 INFO mapred.JobClient: Spilled Records=20
14/11/07 14:54:27 INFO mapred.JobClient: CPU time spent (ms)=0
14/11/07 14:54:27 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
14/11/07 14:54:27 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
14/11/07 14:54:27 INFO mapred.JobClient: Total committed heap usage (bytes)=567279616
14/11/07 14:54:27 INFO mapred.JobClient: org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
14/11/07 14:54:27 INFO mapred.JobClient: BYTES_READ=652
[root@dbversity.com 6]# 
[root@dbversity.com 6]# 
[root@dbversity.com 6]# hadoop fs -ls /tmp/archive_op/
total 148K
-rwxr-xr-x 1 root root 137K Nov 7 14:54 part-0
-rwxr-xr-x 1 root root 23 Nov 7 14:54 _masterindex
-rwxr-xr-x 1 root root 448 Nov 7 14:54 _index
-rwxr-xr-x 1 root root 0 Nov 7 14:54 _SUCCESS
[root@dbversity.com 6]#

  • Ask Question