MongoDB Connector for Hadoop

November 7, 2014 · by dbversity · in Hadoop, MongoDB

Purpose

The MongoDB Connector for Hadoop is a library which allows MongoDB (or backup files in its data format, BSON) to be used as an input source, or output destination, for Hadoop MapReduce tasks. It is designed to allow greater flexibility and performance and make it easy to integrate data in MongoDB with other parts of the Hadoop ecosystem.

Features

Can create data splits to read from standalone, replica set, or sharded configurations
Source data can be filtered with queries using the MongoDB query language
Supports Hadoop Streaming, to allow job code to be written in any language (python, ruby, nodejs currently supported)
Can read data from MongoDB backup files residing on S3, HDFS, or local filesystems
Can write data out in .bson format, which can then be imported to any MongoDB database with mongorestore
Works with BSON/MongoDB documents in other Hadoop tools such as Pig and Hive.

More details at : https://github.com/mongodb/mongo-hadoop/blob/master/README.md

Workout :

[root@dbversity.com 6]# mongo -port 27010
MongoDB shell version: 2.4.11
connecting to: 127.0.0.1:27010/test
Server has startup warnings: 
Fri Nov 7 08:48:49.460 [initandlisten] 
Fri Nov 7 08:48:49.460 [initandlisten] ** WARNING: You are running on a NUMA machine.
Fri Nov 7 08:48:49.460 [initandlisten] ** We suggest launching mongod like this to avoid performance problems:
Fri Nov 7 08:48:49.460 [initandlisten] ** numactl --interleave=all mongod [other options]
Fri Nov 7 08:48:49.460 [initandlisten] 
> 
> db.ufo.find()
{ "_id" : ObjectId("545ccdf75d123a85b95ae576"), "x" : 1 }
{ "_id" : ObjectId("545ccdf95d123a85b95ae577"), "x" : 1 }
{ "_id" : ObjectId("545ccdfa5d123a85b95ae578"), "x" : 1 }
{ "_id" : ObjectId("545ccdfd5d123a85b95ae579"), "x" : 2 }
{ "_id" : 1, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 1 }
{ "_id" : 2, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 2 }
{ "_id" : 3, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 3 }
{ "_id" : 4, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 4 }
{ "_id" : 5, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 5 }
{ "_id" : 6, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 6 }
{ "_id" : 7, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 7 }
{ "_id" : 8, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 8 }
{ "_id" : 9, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 9 }
{ "_id" : 10, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 10 }
{ "_id" : 11, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 11 }
{ "_id" : 12, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 12 }
{ "_id" : 13, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 13 }
{ "_id" : 14, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 14 }
{ "_id" : 15, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 15 }
{ "_id" : 16, "title" : "How do I create manual workload i.e., Bulk inserts to Collection ", " Iteration no:" : 16 }
Type "it" for more
> 
> 
bye
[root@dbversity.com 6]# 
[root@dbversity.com 6]# mongodump -port 27010 --out /opt/mongodb/ufo_07Nov14
connected to: 127.0.0.1:27010
Fri Nov 7 10:35:45.514 all dbs
Fri Nov 7 10:35:45.515 DATABASE: test to /opt/mongodb/ufo_07Nov14/test
Fri Nov 7 10:35:45.517 test.system.indexes to /opt/mongodb/ufo_07Nov14/test/system.indexes.bson
Fri Nov 7 10:35:45.517 1 objects
Fri Nov 7 10:35:45.517 test.ufo to /opt/mongodb/ufo_07Nov14/test/ufo.bson
Fri Nov 7 10:35:45.528 104 objects
Fri Nov 7 10:35:45.528 Metadata for test.ufo to /opt/mongodb/ufo_07Nov14/test/ufo.metadata.json
[root@dbversity.com 6]# 
[root@dbversity.com 6]#

[root@dbversity.com 6]# hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.0.0-cdh4.2.1.jar aggregatewordcount -libjars mongo-2.10.1.jar,mongo-hadoop-core-1.3.0.jar,mongo-hadoop-core_cdh3u3-1.
0.0.jar /opt/mongodb/ufo_07Nov14/test/ufo.bson /tmp/bson 1 textinputformat -Dmongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat -Dmongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat

14/11/07 10:37:23 WARN conf.Configuration: session.id is deprecated. Instead, use dfs.metrics.session-id
14/11/07 10:37:23 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
14/11/07 10:37:23 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/11/07 10:37:23 INFO mapred.FileInputFormat: Total input paths to process : 1
14/11/07 10:37:24 INFO filecache.TrackerDistributedCacheManager: Creating mongo-2.10.1.jar in /tmp/hadoop-root/mapred/local/archive/-772008609334989096_-760476233_2141679274/file/data/6-work--5818031940198828218 with rwxr-xr-x
14/11/07 10:37:24 INFO filecache.TrackerDistributedCacheManager: Cached file:///data/6/mongo-2.10.1.jar as /tmp/hadoop-root/mapred/local/archive/-772008609334989096_-760476233_2141679274/file/data/6/mongo-2.10.1.jar
14/11/07 10:37:24 INFO filecache.TrackerDistributedCacheManager: Cached file:///data/6/mongo-2.10.1.jar as /tmp/hadoop-root/mapred/local/archive/-772008609334989096_-760476233_2141679274/file/data/6/mongo-2.10.1.jar
14/11/07 10:37:24 INFO filecache.TrackerDistributedCacheManager: Creating mongo-hadoop-core-1.3.0.jar in /tmp/hadoop-root/mapred/local/archive/8460599837868318078_514438651_996280568/file/data/6-work--8704548277110338599 with rwxr-xr-x
14/11/07 10:37:24 INFO filecache.TrackerDistributedCacheManager: Cached file:///data/6/mongo-hadoop-core-1.3.0.jar as /tmp/hadoop-root/mapred/local/archive/8460599837868318078_514438651_996280568/file/data/6/mongo-hadoop-core-1.3.0.jar
14/11/07 10:37:24 INFO filecache.TrackerDistributedCacheManager: Cached file:///data/6/mongo-hadoop-core-1.3.0.jar as /tmp/hadoop-root/mapred/local/archive/8460599837868318078_514438651_996280568/file/data/6/mongo-hadoop-core-1.3.0.jar
14/11/07 10:37:24 INFO filecache.TrackerDistributedCacheManager: Creating mongo-hadoop-core_cdh3u3-1.0.0.jar in /tmp/hadoop-root/mapred/local/archive/-2578132968065611085_-932616765_2141682274/file/data/6-work-2037870863955662248 with rwxr-xr-x
14/11/07 10:37:24 INFO filecache.TrackerDistributedCacheManager: Cached file:///data/6/mongo-hadoop-core_cdh3u3-1.0.0.jar as /tmp/hadoop-root/mapred/local/archive/-2578132968065611085_-932616765_2141682274/file/data/6/mongo-hadoop-core_cdh3u3-1.0.0.jar
14/11/07 10:37:24 INFO filecache.TrackerDistributedCacheManager: Cached file:///data/6/mongo-hadoop-core_cdh3u3-1.0.0.jar as /tmp/hadoop-root/mapred/local/archive/-2578132968065611085_-932616765_2141682274/file/data/6/mongo-hadoop-core_cdh3u3-1.0.0.jar
14/11/07 10:37:24 INFO mapred.LocalJobRunner: OutputCommitter set in config null
14/11/07 10:37:24 INFO mapred.JobClient: Running job: job_local1130297672_0001
14/11/07 10:37:24 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
14/11/07 10:37:24 INFO mapred.LocalJobRunner: Waiting for map tasks
14/11/07 10:37:24 INFO mapred.LocalJobRunner: Starting task: attempt_local1130297672_0001_m_000000_0
14/11/07 10:37:24 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
14/11/07 10:37:24 INFO util.ProcessTree: setsid exited with exit code 0
14/11/07 10:37:24 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@67de50fc
14/11/07 10:37:24 INFO mapred.MapTask: Processing split: file:/opt/mongodb/ufo_07Nov14/test/ufo.bson:0+12032
14/11/07 10:37:24 WARN mapreduce.Counters: Counter name MAP_INPUT_BYTES is deprecated. Use FileInputFormatCounters as group name and BYTES_READ as counter name instead
14/11/07 10:37:24 INFO mapred.MapTask: numReduceTasks: 1
14/11/07 10:37:24 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
14/11/07 10:37:24 INFO mapred.MapTask: io.sort.mb = 100
14/11/07 10:37:25 INFO mapred.MapTask: data buffer = 79691776/99614720
14/11/07 10:37:25 INFO mapred.MapTask: record buffer = 262144/327680
14/11/07 10:37:25 INFO mapred.MapTask: Starting flush of map output
14/11/07 10:37:25 INFO mapred.MapTask: Finished spill 0
14/11/07 10:37:25 INFO mapred.Task: Task:attempt_local1130297672_0001_m_000000_0 is done. And is in the process of commiting
14/11/07 10:37:25 INFO mapred.LocalJobRunner: file:/opt/mongodb/ufo_07Nov14/test/ufo.bson:0+12032
14/11/07 10:37:25 INFO mapred.Task: Task 'attempt_local1130297672_0001_m_000000_0' done.
14/11/07 10:37:25 INFO mapred.LocalJobRunner: Finishing task: attempt_local1130297672_0001_m_000000_0
14/11/07 10:37:25 INFO mapred.LocalJobRunner: Map task executor complete.
14/11/07 10:37:25 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
14/11/07 10:37:25 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3b779d81
14/11/07 10:37:25 INFO mapred.LocalJobRunner: 
14/11/07 10:37:25 INFO mapred.Merger: Merging 1 sorted segments
14/11/07 10:37:25 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 6876 bytes
14/11/07 10:37:25 INFO mapred.LocalJobRunner: 
14/11/07 10:37:25 INFO mapred.JobClient: map 100% reduce 0%
14/11/07 10:37:25 INFO mapred.Task: Task:attempt_local1130297672_0001_r_000000_0 is done. And is in the process of commiting
14/11/07 10:37:25 INFO mapred.LocalJobRunner: 
14/11/07 10:37:25 INFO mapred.Task: Task attempt_local1130297672_0001_r_000000_0 is allowed to commit now
14/11/07 10:37:25 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local1130297672_0001_r_000000_0' to file:/tmp/bson
14/11/07 10:37:25 INFO mapred.LocalJobRunner: reduce > reduce
14/11/07 10:37:25 INFO mapred.Task: Task 'attempt_local1130297672_0001_r_000000_0' done.
14/11/07 10:37:26 INFO mapred.JobClient: map 100% reduce 100%
14/11/07 10:37:26 INFO mapred.JobClient: Job complete: job_local1130297672_0001
14/11/07 10:37:26 INFO mapred.JobClient: Counters: 21
14/11/07 10:37:26 INFO mapred.JobClient: File System Counters
14/11/07 10:37:26 INFO mapred.JobClient: FILE: Number of bytes read=1394308
14/11/07 10:37:26 INFO mapred.JobClient: FILE: Number of bytes written=1579137
14/11/07 10:37:26 INFO mapred.JobClient: FILE: Number of read operations=0
14/11/07 10:37:26 INFO mapred.JobClient: FILE: Number of large read operations=0
14/11/07 10:37:26 INFO mapred.JobClient: FILE: Number of write operations=0
14/11/07 10:37:26 INFO mapred.JobClient: Map-Reduce Framework
14/11/07 10:37:26 INFO mapred.JobClient: Map input records=1
14/11/07 10:37:26 INFO mapred.JobClient: Map output records=1303
14/11/07 10:37:26 INFO mapred.JobClient: Map output bytes=31765
14/11/07 10:37:26 INFO mapred.JobClient: Input split bytes=96
14/11/07 10:37:26 INFO mapred.JobClient: Combine input records=1303
14/11/07 10:37:26 INFO mapred.JobClient: Combine output records=115
14/11/07 10:37:26 INFO mapred.JobClient: Reduce input groups=115
14/11/07 10:37:26 INFO mapred.JobClient: Reduce shuffle bytes=0
14/11/07 10:37:26 INFO mapred.JobClient: Reduce input records=115
14/11/07 10:37:26 INFO mapred.JobClient: Reduce output records=115
14/11/07 10:37:26 INFO mapred.JobClient: Spilled Records=230
14/11/07 10:37:26 INFO mapred.JobClient: CPU time spent (ms)=0
14/11/07 10:37:26 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
14/11/07 10:37:26 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
14/11/07 10:37:26 INFO mapred.JobClient: Total committed heap usage (bytes)=567279616
14/11/07 10:37:26 INFO mapred.JobClient: org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
14/11/07 10:37:26 INFO mapred.JobClient: BYTES_READ=12032
[root@dbversity.com 6]#

[root@dbversity.com 6]# hadoop fs -ls /tmp/bson/
Found 2 items
-rwxr-xr-x 1 root root 0 2014-11-07 10:37 /tmp/bson/_SUCCESS
-rwxr-xr-x 1 root root 5147 2014-11-07 10:37 /tmp/bson/part-00000
[root@dbversity.com 6]#

[root@dbversity.com 6]# hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.0.0-cdh4.2.1.jar aggregatewordcount -libjars mongo-2.10.1.jar,mongo-hadoop-core-1.3.0.jar,mongo-hadoop-core_cdh3u3-1.0.0.jar /opt/mongodb/data/test.0 /tmp/out_put 1 textinputformat -Dmongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat

14/11/07 08:57:24 WARN conf.Configuration: session.id is deprecated. Instead, use dfs.metrics.session-id
14/11/07 08:57:24 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
14/11/07 08:57:24 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/11/07 08:57:24 INFO mapred.FileInputFormat: Total input paths to process : 1
14/11/07 08:57:24 INFO filecache.TrackerDistributedCacheManager: Creating mongo-2.10.1.jar in /tmp/hadoop-root/mapred/local/archive/-4327487395782633845_-760476233_2141679274/file/data/6-work--3917532071502946546 with rwxr-xr-x
14/11/07 08:57:24 INFO filecache.TrackerDistributedCacheManager: Cached file:///data/6/mongo-2.10.1.jar as /tmp/hadoop-root/mapred/local/archive/-4327487395782633845_-760476233_2141679274/file/data/6/mongo-2.10.1.jar
14/11/07 08:57:24 INFO filecache.TrackerDistributedCacheManager: Cached file:///data/6/mongo-2.10.1.jar as /tmp/hadoop-root/mapred/local/archive/-4327487395782633845_-760476233_2141679274/file/data/6/mongo-2.10.1.jar
14/11/07 08:57:24 INFO filecache.TrackerDistributedCacheManager: Creating mongo-hadoop-core-1.3.0.jar in /tmp/hadoop-root/mapred/local/archive/911567305902902555_514438651_996280568/file/data/6-work--8831276191346637198 with rwxr-xr-x
14/11/07 08:57:24 INFO filecache.TrackerDistributedCacheManager: Cached file:///data/6/mongo-hadoop-core-1.3.0.jar as /tmp/hadoop-root/mapred/local/archive/911567305902902555_514438651_996280568/file/data/6/mongo-hadoop-core-1.3.0.jar
14/11/07 08:57:24 INFO filecache.TrackerDistributedCacheManager: Cached file:///data/6/mongo-hadoop-core-1.3.0.jar as /tmp/hadoop-root/mapred/local/archive/911567305902902555_514438651_996280568/file/data/6/mongo-hadoop-core-1.3.0.jar
14/11/07 08:57:24 INFO filecache.TrackerDistributedCacheManager: Creating mongo-hadoop-core_cdh3u3-1.0.0.jar in /tmp/hadoop-root/mapred/local/archive/-7049463448163596451_-932616765_2141682274/file/data/6-work-8023342493179971215 with rwxr-xr-x
14/11/07 08:57:24 INFO filecache.TrackerDistributedCacheManager: Cached file:///data/6/mongo-hadoop-core_cdh3u3-1.0.0.jar as /tmp/hadoop-root/mapred/local/archive/-7049463448163596451_-932616765_2141682274/file/data/6/mongo-hadoop-core_cdh3u3-1.0.0.jar
14/11/07 08:57:24 INFO filecache.TrackerDistributedCacheManager: Cached file:///data/6/mongo-hadoop-core_cdh3u3-1.0.0.jar as /tmp/hadoop-root/mapred/local/archive/-7049463448163596451_-932616765_2141682274/file/data/6/mongo-hadoop-core_cdh3u3-1.0.0.jar
14/11/07 08:57:24 INFO mapred.LocalJobRunner: OutputCommitter set in config null
14/11/07 08:57:24 INFO mapred.JobClient: Running job: job_local1459833774_0001
14/11/07 08:57:24 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
14/11/07 08:57:24 INFO mapred.LocalJobRunner: Waiting for map tasks
14/11/07 08:57:24 INFO mapred.LocalJobRunner: Starting task: attempt_local1459833774_0001_m_000000_0
14/11/07 08:57:24 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
14/11/07 08:57:25 INFO util.ProcessTree: setsid exited with exit code 0
14/11/07 08:57:25 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@7e09c25f
14/11/07 08:57:25 INFO mapred.MapTask: Processing split: file:/opt/mongodb/data/test.0:0+16777216
14/11/07 08:57:25 WARN mapreduce.Counters: Counter name MAP_INPUT_BYTES is deprecated. Use FileInputFormatCounters as group name and BYTES_READ as counter name instead
14/11/07 08:57:25 INFO mapred.MapTask: numReduceTasks: 1
14/11/07 08:57:25 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
14/11/07 08:57:25 INFO mapred.MapTask: io.sort.mb = 100
14/11/07 08:57:25 INFO mapred.JobClient: map 0% reduce 0%
14/11/07 08:57:26 INFO mapred.MapTask: data buffer = 79691776/99614720
14/11/07 08:57:26 INFO mapred.MapTask: record buffer = 262144/327680
14/11/07 08:57:28 INFO mapred.MapTask: Starting flush of map output
14/11/07 08:57:29 INFO mapred.MapTask: Finished spill 0
14/11/07 08:57:29 INFO mapred.Task: Task:attempt_local1459833774_0001_m_000000_0 is done. And is in the process of commiting
14/11/07 08:57:29 INFO mapred.LocalJobRunner: file:/opt/mongodb/data/test.0:0+16777216
14/11/07 08:57:29 INFO mapred.Task: Task 'attempt_local1459833774_0001_m_000000_0' done.
14/11/07 08:57:29 INFO mapred.LocalJobRunner: Finishing task: attempt_local1459833774_0001_m_000000_0
14/11/07 08:57:29 INFO mapred.LocalJobRunner: Map task executor complete.
14/11/07 08:57:29 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
14/11/07 08:57:29 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@6ebbb7df
14/11/07 08:57:29 INFO mapred.LocalJobRunner: 
14/11/07 08:57:29 INFO mapred.Merger: Merging 1 sorted segments
14/11/07 08:57:29 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 16777885 bytes
14/11/07 08:57:29 INFO mapred.LocalJobRunner: 
14/11/07 08:57:29 INFO mapred.JobClient: map 100% reduce 0%
14/11/07 08:57:30 INFO mapred.Task: Task:attempt_local1459833774_0001_r_000000_0 is done. And is in the process of commiting
14/11/07 08:57:30 INFO mapred.LocalJobRunner: 
14/11/07 08:57:30 INFO mapred.Task: Task attempt_local1459833774_0001_r_000000_0 is allowed to commit now
14/11/07 08:57:30 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local1459833774_0001_r_000000_0' to file:/tmp/out_put
14/11/07 08:57:30 INFO mapred.LocalJobRunner: reduce > reduce
14/11/07 08:57:30 INFO mapred.Task: Task 'attempt_local1459833774_0001_r_000000_0' done.
14/11/07 08:57:31 INFO mapred.JobClient: map 100% reduce 100%
14/11/07 08:57:31 INFO mapred.JobClient: Job complete: job_local1459833774_0001
14/11/07 08:57:31 INFO mapred.JobClient: Counters: 21
14/11/07 08:57:31 INFO mapred.JobClient: File System Counters
14/11/07 08:57:31 INFO mapred.JobClient: FILE: Number of bytes read=51695661
14/11/07 08:57:31 INFO mapred.JobClient: FILE: Number of bytes written=52024591
14/11/07 08:57:31 INFO mapred.JobClient: FILE: Number of read operations=0
14/11/07 08:57:31 INFO mapred.JobClient: FILE: Number of large read operations=0
14/11/07 08:57:31 INFO mapred.JobClient: FILE: Number of write operations=0
14/11/07 08:57:31 INFO mapred.JobClient: Map-Reduce Framework
14/11/07 08:57:31 INFO mapred.JobClient: Map input records=1
14/11/07 08:57:31 INFO mapred.JobClient: Map output records=16
14/11/07 08:57:31 INFO mapred.JobClient: Map output bytes=16777839
14/11/07 08:57:31 INFO mapred.JobClient: Input split bytes=82
14/11/07 08:57:31 INFO mapred.JobClient: Combine input records=16
14/11/07 08:57:31 INFO mapred.JobClient: Combine output records=16
14/11/07 08:57:31 INFO mapred.JobClient: Reduce input groups=16
14/11/07 08:57:31 INFO mapred.JobClient: Reduce shuffle bytes=0
14/11/07 08:57:31 INFO mapred.JobClient: Reduce input records=16
14/11/07 08:57:31 INFO mapred.JobClient: Reduce output records=16
14/11/07 08:57:31 INFO mapred.JobClient: Spilled Records=32
14/11/07 08:57:31 INFO mapred.JobClient: CPU time spent (ms)=0
14/11/07 08:57:31 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
14/11/07 08:57:31 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
14/11/07 08:57:31 INFO mapred.JobClient: Total committed heap usage (bytes)=1574961152
14/11/07 08:57:31 INFO mapred.JobClient: org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
14/11/07 08:57:31 INFO mapred.JobClient: BYTES_READ=16777216
[root@dbversity.com 6]# 
[root@dbversity.com 6]# 
[root@dbversity.com 6]# hadoop fs -cat /tmp/out_put/part-00000

L!4 1
ï¿½ï¿½ï¿½ï¿½ï¿½!ï¿½ï¿½ï¿½ï¿½ï¿½L!ï¿T\ï¿½ï¿½]:ï¿½ï¿½Zï¿½T\ï¿½ï¿½]:ï¿½ï¿½Zï¿½T\ï¿½ï¿½]:ï¿½ï¿½Zï¿½T\ï¿½ï¿½]:ï¿½ï¿½Zï¿½v`Pï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ 1
ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ 1
Pï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½f4ï¿½ï¿½ï¿½ï¿½ï¿½ 1
test.ufo0ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½DCBA@ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½test.system.namespacesï¿½@A4@ï¿½@ï¿½ï¿½ï¿½ï¿½#nametest.system.indexes0@Aï¿½@nametest.ufo.$_id_(@ï¿½ï¿½ï¿½ï¿½ï¿½@name 1
test.ufoï¿½@ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½DCBAPï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½test.ufo.$_id_ï¿½`` 1
!ï¿½ 1
!_idT\ï¿½ï¿½]:ï¿½ï¿½Zï¿½wxï¿½?4 1
!_idT\ï¿½ï¿½]:ï¿½ï¿½Zï¿½xxï¿½?4 1
L!ï¿½ 1

since it's BSON we can not read it.

I'll post the more elaborated version shortly, stay tune !

Tags: connector, hadoop-mongo, mapR, mapreduce, mongoDB, shard cluster

2 Responses

Dbversity.com · November 10, 2014 at 18:42:00 · →

MongoDB Connector for Hadoop
http://dbversity.com/mongodb-connector-for-hadoop/
MongoDB Connector for Hadoop | DBVERSITY.COM (U… · November 10, 2014 at 19:13:51 · →

[…] Purpose The MongoDB Connector for Hadoop is a library which allows MongoDB (or backup files in its data format, BSON) to be used as an input source, or output destination, for Hadoop MapReduce tasks. It is designed to allow greater… […]

MongoDB Connector for Hadoop

Purpose

Features

2 Responses

Leave a Reply

Categories

Categories