MongoDB Connector for Hadoop
Purpose
The MongoDB Connector for Hadoop is a library which allows MongoDB (or backup files in its data format, BSON) to be used as an input source, or output destination, for Hadoop MapReduce tasks. It is designed to allow greater flexibility and performance and make it easy to integrate data in MongoDB with other parts of the Hadoop ecosystem.
Features
- Can create data splits to read from standalone, replica set, or sharded configurations
- Source data can be filtered with queries using the MongoDB query language
- Supports Hadoop Streaming, to allow job code to be written in any language (python, ruby, nodejs currently supported)
- Can read data from MongoDB backup files residing on S3, HDFS, or local filesystems
- Can write data out in .bson format, which can then be imported to any MongoDB database with
mongorestore
- Works with BSON/MongoDB documents in other Hadoop tools such as Pig and Hive.
More details at : https://github.com/mongodb/mongo-hadoop/blob/master/README.md
Workout :
[root@dbversity.com 6]# mongo -port 27010 MongoDB shell version: 2.4.11 connecting to: 127.0.0.1:27010/test Server has startup warnings: Fri Nov 7 08:48:49.460 [initandlisten] Fri Nov 7 08:48:49.460 [initandlisten] ** WARNING: You are running on a NUMA machine. Fri Nov 7 08:48:49.460 [initandlisten] ** We suggest launching mongod like this to avoid performance problems: Fri Nov 7 08:48:49.460 [initandlisten] ** numactl --interleave=all mongod [other options] Fri Nov 7 08:48:49.460 [initandlisten] > > db.ufo.find() { "_id" : ObjectId("545ccdf75d123a85b95ae576"), "x" : 1 } { "_id" : ObjectId("545ccdf95d123a85b95ae577"), "x" : 1 } { "_id" : ObjectId("545ccdfa5d123a85b95ae578"), "x" : 1 } { "_id" : ObjectId("545ccdfd5d123a85b95ae579"), "x" : 2 } { "_id" : 1, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 1 } { "_id" : 2, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 2 } { "_id" : 3, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 3 } { "_id" : 4, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 4 } { "_id" : 5, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 5 } { "_id" : 6, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 6 } { "_id" : 7, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 7 } { "_id" : 8, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 8 } { "_id" : 9, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 9 } { "_id" : 10, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 10 } { "_id" : 11, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 11 } { "_id" : 12, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 12 } { "_id" : 13, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 13 } { "_id" : 14, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 14 } { "_id" : 15, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 15 } { "_id" : 16, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 16 } Type "it" for more > > bye [root@dbversity.com 6]# [root@dbversity.com 6]# mongodump -port 27010 --out /opt/mongodb/ufo_07Nov14 connected to: 127.0.0.1:27010 Fri Nov 7 10:35:45.514 all dbs Fri Nov 7 10:35:45.515 DATABASE: test to /opt/mongodb/ufo_07Nov14/test Fri Nov 7 10:35:45.517 test.system.indexes to /opt/mongodb/ufo_07Nov14/test/system.indexes.bson Fri Nov 7 10:35:45.517 1 objects Fri Nov 7 10:35:45.517 test.ufo to /opt/mongodb/ufo_07Nov14/test/ufo.bson Fri Nov 7 10:35:45.528 104 objects Fri Nov 7 10:35:45.528 Metadata for test.ufo to /opt/mongodb/ufo_07Nov14/test/ufo.metadata.json [root@dbversity.com 6]# [root@dbversity.com 6]#
[root@dbversity.com 6]# hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.0.0-cdh4.2.1.jar aggregatewordcount -libjars mongo-2.10.1.jar,mongo-hadoop-core-1.3.0.jar,mongo-hadoop-core_cdh3u3-1. 0.0.jar /opt/mongodb/ufo_07Nov14/test/ufo.bson /tmp/bson 1 textinputformat -Dmongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat -Dmongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat
14/11/07 10:37:23 WARN conf.Configuration: session.id is deprecated. Instead, use dfs.metrics.session-id 14/11/07 10:37:23 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 14/11/07 10:37:23 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 14/11/07 10:37:23 INFO mapred.FileInputFormat: Total input paths to process : 1 14/11/07 10:37:24 INFO filecache.TrackerDistributedCacheManager: Creating mongo-2.10.1.jar in /tmp/hadoop-root/mapred/local/archive/-772008609334989096_-760476233_2141679274/file/data/6-work--5818031940198828218 with rwxr-xr-x 14/11/07 10:37:24 INFO filecache.TrackerDistributedCacheManager: Cached file:///data/6/mongo-2.10.1.jar as /tmp/hadoop-root/mapred/local/archive/-772008609334989096_-760476233_2141679274/file/data/6/mongo-2.10.1.jar 14/11/07 10:37:24 INFO filecache.TrackerDistributedCacheManager: Cached file:///data/6/mongo-2.10.1.jar as /tmp/hadoop-root/mapred/local/archive/-772008609334989096_-760476233_2141679274/file/data/6/mongo-2.10.1.jar 14/11/07 10:37:24 INFO filecache.TrackerDistributedCacheManager: Creating mongo-hadoop-core-1.3.0.jar in /tmp/hadoop-root/mapred/local/archive/8460599837868318078_514438651_996280568/file/data/6-work--8704548277110338599 with rwxr-xr-x 14/11/07 10:37:24 INFO filecache.TrackerDistributedCacheManager: Cached file:///data/6/mongo-hadoop-core-1.3.0.jar as /tmp/hadoop-root/mapred/local/archive/8460599837868318078_514438651_996280568/file/data/6/mongo-hadoop-core-1.3.0.jar 14/11/07 10:37:24 INFO filecache.TrackerDistributedCacheManager: Cached file:///data/6/mongo-hadoop-core-1.3.0.jar as /tmp/hadoop-root/mapred/local/archive/8460599837868318078_514438651_996280568/file/data/6/mongo-hadoop-core-1.3.0.jar 14/11/07 10:37:24 INFO filecache.TrackerDistributedCacheManager: Creating mongo-hadoop-core_cdh3u3-1.0.0.jar in /tmp/hadoop-root/mapred/local/archive/-2578132968065611085_-932616765_2141682274/file/data/6-work-2037870863955662248 with rwxr-xr-x 14/11/07 10:37:24 INFO filecache.TrackerDistributedCacheManager: Cached file:///data/6/mongo-hadoop-core_cdh3u3-1.0.0.jar as /tmp/hadoop-root/mapred/local/archive/-2578132968065611085_-932616765_2141682274/file/data/6/mongo-hadoop-core_cdh3u3-1.0.0.jar 14/11/07 10:37:24 INFO filecache.TrackerDistributedCacheManager: Cached file:///data/6/mongo-hadoop-core_cdh3u3-1.0.0.jar as /tmp/hadoop-root/mapred/local/archive/-2578132968065611085_-932616765_2141682274/file/data/6/mongo-hadoop-core_cdh3u3-1.0.0.jar 14/11/07 10:37:24 INFO mapred.LocalJobRunner: OutputCommitter set in config null 14/11/07 10:37:24 INFO mapred.JobClient: Running job: job_local1130297672_0001 14/11/07 10:37:24 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter 14/11/07 10:37:24 INFO mapred.LocalJobRunner: Waiting for map tasks 14/11/07 10:37:24 INFO mapred.LocalJobRunner: Starting task: attempt_local1130297672_0001_m_000000_0 14/11/07 10:37:24 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead 14/11/07 10:37:24 INFO util.ProcessTree: setsid exited with exit code 0 14/11/07 10:37:24 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@67de50fc 14/11/07 10:37:24 INFO mapred.MapTask: Processing split: file:/opt/mongodb/ufo_07Nov14/test/ufo.bson:0+12032 14/11/07 10:37:24 WARN mapreduce.Counters: Counter name MAP_INPUT_BYTES is deprecated. Use FileInputFormatCounters as group name and BYTES_READ as counter name instead 14/11/07 10:37:24 INFO mapred.MapTask: numReduceTasks: 1 14/11/07 10:37:24 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer 14/11/07 10:37:24 INFO mapred.MapTask: io.sort.mb = 100 14/11/07 10:37:25 INFO mapred.MapTask: data buffer = 79691776/99614720 14/11/07 10:37:25 INFO mapred.MapTask: record buffer = 262144/327680 14/11/07 10:37:25 INFO mapred.MapTask: Starting flush of map output 14/11/07 10:37:25 INFO mapred.MapTask: Finished spill 0 14/11/07 10:37:25 INFO mapred.Task: Task:attempt_local1130297672_0001_m_000000_0 is done. And is in the process of commiting 14/11/07 10:37:25 INFO mapred.LocalJobRunner: file:/opt/mongodb/ufo_07Nov14/test/ufo.bson:0+12032 14/11/07 10:37:25 INFO mapred.Task: Task 'attempt_local1130297672_0001_m_000000_0' done. 14/11/07 10:37:25 INFO mapred.LocalJobRunner: Finishing task: attempt_local1130297672_0001_m_000000_0 14/11/07 10:37:25 INFO mapred.LocalJobRunner: Map task executor complete. 14/11/07 10:37:25 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead 14/11/07 10:37:25 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3b779d81 14/11/07 10:37:25 INFO mapred.LocalJobRunner: 14/11/07 10:37:25 INFO mapred.Merger: Merging 1 sorted segments 14/11/07 10:37:25 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 6876 bytes 14/11/07 10:37:25 INFO mapred.LocalJobRunner: 14/11/07 10:37:25 INFO mapred.JobClient: map 100% reduce 0% 14/11/07 10:37:25 INFO mapred.Task: Task:attempt_local1130297672_0001_r_000000_0 is done. And is in the process of commiting 14/11/07 10:37:25 INFO mapred.LocalJobRunner: 14/11/07 10:37:25 INFO mapred.Task: Task attempt_local1130297672_0001_r_000000_0 is allowed to commit now 14/11/07 10:37:25 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local1130297672_0001_r_000000_0' to file:/tmp/bson 14/11/07 10:37:25 INFO mapred.LocalJobRunner: reduce > reduce 14/11/07 10:37:25 INFO mapred.Task: Task 'attempt_local1130297672_0001_r_000000_0' done. 14/11/07 10:37:26 INFO mapred.JobClient: map 100% reduce 100% 14/11/07 10:37:26 INFO mapred.JobClient: Job complete: job_local1130297672_0001 14/11/07 10:37:26 INFO mapred.JobClient: Counters: 21 14/11/07 10:37:26 INFO mapred.JobClient: File System Counters 14/11/07 10:37:26 INFO mapred.JobClient: FILE: Number of bytes read=1394308 14/11/07 10:37:26 INFO mapred.JobClient: FILE: Number of bytes written=1579137 14/11/07 10:37:26 INFO mapred.JobClient: FILE: Number of read operations=0 14/11/07 10:37:26 INFO mapred.JobClient: FILE: Number of large read operations=0 14/11/07 10:37:26 INFO mapred.JobClient: FILE: Number of write operations=0 14/11/07 10:37:26 INFO mapred.JobClient: Map-Reduce Framework 14/11/07 10:37:26 INFO mapred.JobClient: Map input records=1 14/11/07 10:37:26 INFO mapred.JobClient: Map output records=1303 14/11/07 10:37:26 INFO mapred.JobClient: Map output bytes=31765 14/11/07 10:37:26 INFO mapred.JobClient: Input split bytes=96 14/11/07 10:37:26 INFO mapred.JobClient: Combine input records=1303 14/11/07 10:37:26 INFO mapred.JobClient: Combine output records=115 14/11/07 10:37:26 INFO mapred.JobClient: Reduce input groups=115 14/11/07 10:37:26 INFO mapred.JobClient: Reduce shuffle bytes=0 14/11/07 10:37:26 INFO mapred.JobClient: Reduce input records=115 14/11/07 10:37:26 INFO mapred.JobClient: Reduce output records=115 14/11/07 10:37:26 INFO mapred.JobClient: Spilled Records=230 14/11/07 10:37:26 INFO mapred.JobClient: CPU time spent (ms)=0 14/11/07 10:37:26 INFO mapred.JobClient: Physical memory (bytes) snapshot=0 14/11/07 10:37:26 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0 14/11/07 10:37:26 INFO mapred.JobClient: Total committed heap usage (bytes)=567279616 14/11/07 10:37:26 INFO mapred.JobClient: org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter 14/11/07 10:37:26 INFO mapred.JobClient: BYTES_READ=12032 [root@dbversity.com 6]#
[root@dbversity.com 6]# hadoop fs -ls /tmp/bson/ Found 2 items -rwxr-xr-x 1 root root 0 2014-11-07 10:37 /tmp/bson/_SUCCESS -rwxr-xr-x 1 root root 5147 2014-11-07 10:37 /tmp/bson/part-00000 [root@dbversity.com 6]#
[root@dbversity.com 6]# hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.0.0-cdh4.2.1.jar aggregatewordcount -libjars mongo-2.10.1.jar,mongo-hadoop-core-1.3.0.jar,mongo-hadoop-core_cdh3u3-1.0.0.jar /opt/mongodb/data/test.0 /tmp/out_put 1 textinputformat -Dmongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat
14/11/07 08:57:24 WARN conf.Configuration: session.id is deprecated. Instead, use dfs.metrics.session-id 14/11/07 08:57:24 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 14/11/07 08:57:24 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 14/11/07 08:57:24 INFO mapred.FileInputFormat: Total input paths to process : 1 14/11/07 08:57:24 INFO filecache.TrackerDistributedCacheManager: Creating mongo-2.10.1.jar in /tmp/hadoop-root/mapred/local/archive/-4327487395782633845_-760476233_2141679274/file/data/6-work--3917532071502946546 with rwxr-xr-x 14/11/07 08:57:24 INFO filecache.TrackerDistributedCacheManager: Cached file:///data/6/mongo-2.10.1.jar as /tmp/hadoop-root/mapred/local/archive/-4327487395782633845_-760476233_2141679274/file/data/6/mongo-2.10.1.jar 14/11/07 08:57:24 INFO filecache.TrackerDistributedCacheManager: Cached file:///data/6/mongo-2.10.1.jar as /tmp/hadoop-root/mapred/local/archive/-4327487395782633845_-760476233_2141679274/file/data/6/mongo-2.10.1.jar 14/11/07 08:57:24 INFO filecache.TrackerDistributedCacheManager: Creating mongo-hadoop-core-1.3.0.jar in /tmp/hadoop-root/mapred/local/archive/911567305902902555_514438651_996280568/file/data/6-work--8831276191346637198 with rwxr-xr-x 14/11/07 08:57:24 INFO filecache.TrackerDistributedCacheManager: Cached file:///data/6/mongo-hadoop-core-1.3.0.jar as /tmp/hadoop-root/mapred/local/archive/911567305902902555_514438651_996280568/file/data/6/mongo-hadoop-core-1.3.0.jar 14/11/07 08:57:24 INFO filecache.TrackerDistributedCacheManager: Cached file:///data/6/mongo-hadoop-core-1.3.0.jar as /tmp/hadoop-root/mapred/local/archive/911567305902902555_514438651_996280568/file/data/6/mongo-hadoop-core-1.3.0.jar 14/11/07 08:57:24 INFO filecache.TrackerDistributedCacheManager: Creating mongo-hadoop-core_cdh3u3-1.0.0.jar in /tmp/hadoop-root/mapred/local/archive/-7049463448163596451_-932616765_2141682274/file/data/6-work-8023342493179971215 with rwxr-xr-x 14/11/07 08:57:24 INFO filecache.TrackerDistributedCacheManager: Cached file:///data/6/mongo-hadoop-core_cdh3u3-1.0.0.jar as /tmp/hadoop-root/mapred/local/archive/-7049463448163596451_-932616765_2141682274/file/data/6/mongo-hadoop-core_cdh3u3-1.0.0.jar 14/11/07 08:57:24 INFO filecache.TrackerDistributedCacheManager: Cached file:///data/6/mongo-hadoop-core_cdh3u3-1.0.0.jar as /tmp/hadoop-root/mapred/local/archive/-7049463448163596451_-932616765_2141682274/file/data/6/mongo-hadoop-core_cdh3u3-1.0.0.jar 14/11/07 08:57:24 INFO mapred.LocalJobRunner: OutputCommitter set in config null 14/11/07 08:57:24 INFO mapred.JobClient: Running job: job_local1459833774_0001 14/11/07 08:57:24 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter 14/11/07 08:57:24 INFO mapred.LocalJobRunner: Waiting for map tasks 14/11/07 08:57:24 INFO mapred.LocalJobRunner: Starting task: attempt_local1459833774_0001_m_000000_0 14/11/07 08:57:24 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead 14/11/07 08:57:25 INFO util.ProcessTree: setsid exited with exit code 0 14/11/07 08:57:25 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@7e09c25f 14/11/07 08:57:25 INFO mapred.MapTask: Processing split: file:/opt/mongodb/data/test.0:0+16777216 14/11/07 08:57:25 WARN mapreduce.Counters: Counter name MAP_INPUT_BYTES is deprecated. Use FileInputFormatCounters as group name and BYTES_READ as counter name instead 14/11/07 08:57:25 INFO mapred.MapTask: numReduceTasks: 1 14/11/07 08:57:25 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer 14/11/07 08:57:25 INFO mapred.MapTask: io.sort.mb = 100 14/11/07 08:57:25 INFO mapred.JobClient: map 0% reduce 0% 14/11/07 08:57:26 INFO mapred.MapTask: data buffer = 79691776/99614720 14/11/07 08:57:26 INFO mapred.MapTask: record buffer = 262144/327680 14/11/07 08:57:28 INFO mapred.MapTask: Starting flush of map output 14/11/07 08:57:29 INFO mapred.MapTask: Finished spill 0 14/11/07 08:57:29 INFO mapred.Task: Task:attempt_local1459833774_0001_m_000000_0 is done. And is in the process of commiting 14/11/07 08:57:29 INFO mapred.LocalJobRunner: file:/opt/mongodb/data/test.0:0+16777216 14/11/07 08:57:29 INFO mapred.Task: Task 'attempt_local1459833774_0001_m_000000_0' done. 14/11/07 08:57:29 INFO mapred.LocalJobRunner: Finishing task: attempt_local1459833774_0001_m_000000_0 14/11/07 08:57:29 INFO mapred.LocalJobRunner: Map task executor complete. 14/11/07 08:57:29 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead 14/11/07 08:57:29 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@6ebbb7df 14/11/07 08:57:29 INFO mapred.LocalJobRunner: 14/11/07 08:57:29 INFO mapred.Merger: Merging 1 sorted segments 14/11/07 08:57:29 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 16777885 bytes 14/11/07 08:57:29 INFO mapred.LocalJobRunner: 14/11/07 08:57:29 INFO mapred.JobClient: map 100% reduce 0% 14/11/07 08:57:30 INFO mapred.Task: Task:attempt_local1459833774_0001_r_000000_0 is done. And is in the process of commiting 14/11/07 08:57:30 INFO mapred.LocalJobRunner: 14/11/07 08:57:30 INFO mapred.Task: Task attempt_local1459833774_0001_r_000000_0 is allowed to commit now 14/11/07 08:57:30 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local1459833774_0001_r_000000_0' to file:/tmp/out_put 14/11/07 08:57:30 INFO mapred.LocalJobRunner: reduce > reduce 14/11/07 08:57:30 INFO mapred.Task: Task 'attempt_local1459833774_0001_r_000000_0' done. 14/11/07 08:57:31 INFO mapred.JobClient: map 100% reduce 100% 14/11/07 08:57:31 INFO mapred.JobClient: Job complete: job_local1459833774_0001 14/11/07 08:57:31 INFO mapred.JobClient: Counters: 21 14/11/07 08:57:31 INFO mapred.JobClient: File System Counters 14/11/07 08:57:31 INFO mapred.JobClient: FILE: Number of bytes read=51695661 14/11/07 08:57:31 INFO mapred.JobClient: FILE: Number of bytes written=52024591 14/11/07 08:57:31 INFO mapred.JobClient: FILE: Number of read operations=0 14/11/07 08:57:31 INFO mapred.JobClient: FILE: Number of large read operations=0 14/11/07 08:57:31 INFO mapred.JobClient: FILE: Number of write operations=0 14/11/07 08:57:31 INFO mapred.JobClient: Map-Reduce Framework 14/11/07 08:57:31 INFO mapred.JobClient: Map input records=1 14/11/07 08:57:31 INFO mapred.JobClient: Map output records=16 14/11/07 08:57:31 INFO mapred.JobClient: Map output bytes=16777839 14/11/07 08:57:31 INFO mapred.JobClient: Input split bytes=82 14/11/07 08:57:31 INFO mapred.JobClient: Combine input records=16 14/11/07 08:57:31 INFO mapred.JobClient: Combine output records=16 14/11/07 08:57:31 INFO mapred.JobClient: Reduce input groups=16 14/11/07 08:57:31 INFO mapred.JobClient: Reduce shuffle bytes=0 14/11/07 08:57:31 INFO mapred.JobClient: Reduce input records=16 14/11/07 08:57:31 INFO mapred.JobClient: Reduce output records=16 14/11/07 08:57:31 INFO mapred.JobClient: Spilled Records=32 14/11/07 08:57:31 INFO mapred.JobClient: CPU time spent (ms)=0 14/11/07 08:57:31 INFO mapred.JobClient: Physical memory (bytes) snapshot=0 14/11/07 08:57:31 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0 14/11/07 08:57:31 INFO mapred.JobClient: Total committed heap usage (bytes)=1574961152 14/11/07 08:57:31 INFO mapred.JobClient: org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter 14/11/07 08:57:31 INFO mapred.JobClient: BYTES_READ=16777216 [root@dbversity.com 6]# [root@dbversity.com 6]# [root@dbversity.com 6]# hadoop fs -cat /tmp/out_put/part-00000
L!4 1 �����!�����L!ï¿T\��]:��Z�T\��]:��Z�T\��]:��Z�T\��]:��Z�v`P�������� 1 ������ 1 P����������������f4����� 1 test.ufo0��������DCBA@��������test.system.namespaces�@A4@�@����#nametest.system.indexes0@A�@nametest.ufo.$_id_(@�����@name 1 test.ufo�@��������DCBAP��������test.ufo.$_id_�`` 1 !� 1 !_idT\��]:��Z�wx�?4 1 !_idT\��]:��Z�xx�?4 1 L!� 1 since it's BSON we can not read it. I'll post the more elaborated version shortly, stay tune !
MongoDB Connector for Hadoop
http://dbversity.com/mongodb-connector-for-hadoop/
[…] Purpose The MongoDB Connector for Hadoop is a library which allows MongoDB (or backup files in its data format, BSON) to be used as an input source, or output destination, for Hadoop MapReduce tasks. It is designed to allow greater… […]