MongoDB Aggregation Concepts

October 20, 2014 · by dbversity · in MongoDB

MongoDB provides the three approaches to aggregation, each with its own strengths and purposes for a given situation. This section describes these approaches and also describes behaviors and limitations specific to each approach. See also the chart that compares the approaches.

Aggregation Pipeline
The aggregation pipeline is a framework for performing aggregation tasks, modeled on the concept of data processing pipelines. Using this framework, MongoDB passes the documents of a single collection through a pipeline. The pipeline transforms the documents into aggregated results, and is accessed through the aggregate database command.

Map-Reduce
Map-reduce is a generic multi-phase data aggregation modality for processing quantities of data. MongoDB provides map-reduce with the mapReduce database command.

Single Purpose Aggregation Operations
MongoDB provides a collection of specific data aggregation operations to support a number of common data aggregation functions. These operations include returning counts of documents, distinct values of a field, and simple grouping operations.

Aggregation Mechanics
Details internal optimization operations, limits, support for sharded collections, and concurrency concerns.

MongoDB – Aggregation Framework Examples.

[root@hostname hpadmin]# /opt/mongodb/bin/mongo --port 57017
MongoDB shell version: 2.4.5
connecting to: 127.0.0.1:57017/test
>
> show dbs
local 0.078125GB
>
> use cmpnydata
switched to db cmpnydata
>
>
> db.aggr_test.insert({"author":"mark", "status":"active", "post": "this is excellent and awesome and useful.", tags:["excellent", "awesome", "useful"], "views":10})
> db.aggr_test.insert({"author":"rosie", "status":"active", "post": "this is spicy and useful.", tags:["spicy", "useful"], "views":30})
> db.aggr_test.insert({"author":"lee", "status":"active", "post": "mongodb rocks and awesome.", tags:["mongodb", "rocks", "awesome"], "views":35})
> db.aggr_test.insert({"author":"mike", "status":"inactive", "post": "mongodb aggregation with hadoop is awesome", tags:["mongodb", "aggregation", "hadoop", "awesome"], "views":50})
> db.aggr_test.insert({"author":"stanley", "status":"active", "post": "this is tremendous and awesome and useful.", tags:["tremendous", "awesome", "useful"], "views":5})
>
> db.aggr_test.find()
{ "_id" : ObjectId("53c641ae85cb44b15b60b366"), "author" : "mark", "status" : "active", "post" : "this is excellent and awesome and useful.", "tags" : [ "excellent", "awesome", "useful" ], "views" : 10 }
{ "_id" : ObjectId("53c641b685cb44b15b60b367"), "author" : "rosie", "status" : "active", "post" : "this is spicy and useful.", "tags" : [ "spicy", "useful" ], "views" : 30 }
{ "_id" : ObjectId("53c641b685cb44b15b60b368"), "author" : "lee", "status" : "active", "post" : "mongodb rocks and awesome.", "tags" : [ "mongodb", "rocks", "awesome" ], "views" : 35 }
{ "_id" : ObjectId("53c641b685cb44b15b60b369"), "author" : "mike", "status" : "inactive", "post" : "mongodb aggregation with hadoop is awesome", "tags" : [ "mongodb", "aggregation", "hadoop", "awesome" ], "views" : 50 }
{ "_id" : ObjectId("53c641b885cb44b15b60b36a"), "author" : "stanley", "status" : "active", "post" : "this is tremendous and awesome and useful.", "tags" : [ "tremendous", "awesome", "useful" ], "views" : 5 }
>
> db.aggr_test.find().pretty()
{
"_id" : ObjectId("53c641ae85cb44b15b60b366"),
"author" : "mark",
"status" : "active",
"post" : "this is excellent and awesome and useful.",
"tags" : [
 "excellent",
 "awesome",
 "useful"
],
"views" : 10
}
{
"_id" : ObjectId("53c641b685cb44b15b60b367"),
"author" : "rosie",
"status" : "active",
"post" : "this is spicy and useful.",
"tags" : [
 "spicy",
 "useful"
],
"views" : 30
}
{
"_id" : ObjectId("53c641b685cb44b15b60b368"),
"author" : "lee",
"status" : "active",
"post" : "mongodb rocks and awesome.",
"tags" : [
 "mongodb",
 "rocks",
 "awesome"
],
"views" : 35
}
{
"_id" : ObjectId("53c641b685cb44b15b60b369"),
"author" : "mike",
"status" : "inactive",
"post" : "mongodb aggregation with hadoop is awesome",
"tags" : [
 "mongodb",
 "aggregation",
 "hadoop",
 "awesome"
],
"views" : 50
}
{
"_id" : ObjectId("53c641b885cb44b15b60b36a"),
"author" : "stanley",
"status" : "active",
"post" : "this is tremendous and awesome and useful.",
"tags" : [
 "tremendous",
"awesome",
 "useful"
],
"views" : 5
}
> db.aggr_test.find()
{ "_id" : ObjectId("53c641ae85cb44b15b60b366"), "author" : "mark", "status" : "active", "post" : "this is excellent and awesome and useful.", "tags" : [ "excellent", "awesome", "useful" ], "views" : 10 }
{ "_id" : ObjectId("53c641b685cb44b15b60b367"), "author" : "rosie", "status" : "active", "post" : "this is spicy and useful.", "tags" : [ "spicy", "useful" ], "views" : 30 }
{ "_id" : ObjectId("53c641b685cb44b15b60b368"), "author" : "lee", "status" : "active", "post" : "mongodb rocks and awesome.", "tags" : [ "mongodb", "rocks", "awesome" ], "views" : 35 }
{ "_id" : ObjectId("53c641b685cb44b15b60b369"), "author" : "mike", "status" : "inactive", "post" : "mongodb aggregation with hadoop is awesome", "tags" : [ "mongodb", "aggregation", "hadoop", "awesome" ], "views" : 50 }
{ "_id" : ObjectId("53c641b885cb44b15b60b36a"), "author" : "stanley", "status" : "active", "post" : "this is tremendous and awesome and useful.", "tags" : [ "tremendous", "awesome", "useful" ], "views" : 5 }
>
>
>
> show dbs
cmpnydata 0.203125GB
local 0.078125GB
>
> use cmpnydata
switched to db cmpnydata
>
> show collections
aggr_test
system.indexes
>
 
> // SQL Syntax for getting author wise Post & view counts
>
> //SELECT
> //author,
> //count(*) as postCnt,
> //sum(views) as viewCnt
> //FROM <table_name>
> //GROUP BY author;
>
>
> // MongoDB Query for getting Author wise Post & view Counts
>
> db.aggr_test.aggregate(
... 
... {$group :
... {_id: "$author",
... postCnt: {$sum :1 },
... viewCnt: {$sum : "$views" }
... }
... }
... )

O/p :-

{
"result" : [
 {
 "_id" : "stanley",
 "postCnt" : 1,
 "viewCnt" : 5
 },
 {
 "_id" : "mike",
 "postCnt" : 1,
 "viewCnt" : 50
 },
{
 "_id" : "lee",
 "postCnt" : 1,
 "viewCnt" : 35
 },
 {
 "_id" : "rosie",
 "postCnt" : 1,
 "viewCnt" : 30
 },
 {
 "_id" : "mark",
 "postCnt" : 1,
 "viewCnt" : 10
 }
],
"ok" : 1
}
>
 
 
> //SQL query to get authors, postCnt, viewCnt for the tags which are having viewCnt >=10 (by the active authors) in descending order with display of top 2 records.
 
> //SELECT
> //
> //tag,
> //GROUP_CONCAT(author),
> //SUM(post) postCnt,
> //SUM(views) as viewCnt
> //
> //FROM
> //
> //agg_test
> //
> //WHERE status='active' GROUP BY tags HAVING viewCnt >= 10 ORDER BY viewCnt DESC LIMIT 2;
 
MongoDB Query for the same
 
>
> db.aggr_test.aggregate(
...
... {$match : {status: "active"}},
... {$project :
...
... { author : 1,
... tags : 1,
... views : 1
... }
... },
...
... {$unwind : "$tags" },
... {$group :
... {_id: "$tags",
... author: {$addToSet : "$author" },
... postCnt: {$sum :1 },
... viewCnt: {$sum : "$views" }
... }
... },
...
... {$match : {viewCnt : {$gte : 10 }}},
... {$sort : {viewCnt : -1}},
... {$limit : 2}
...
... )

O/p :-

{
"result" : [
 {
 "_id" : "awesome",
 "author" : [
"stanley",
"lee",
"mark"
 ],
 "postCnt" : 3,
 "viewCnt" : 50
 },
 {
 "_id" : "useful",
 "author" : [
"stanley",
"rosie",
"mark"
 ],
 "postCnt" : 3,
 "viewCnt" : 45
 }
],
"ok" : 1
}
>

Another example :
If you have 2 dates value_date1 and value_date2 stored in your documents and you want to return all documents where the the difference is a given number of days.

The aggregation below calculated the difference (in milliseconds) between $a and $b and then filters out only differences of 4 days (plus or minus 2 seconds)
 
 
db.compare.aggregate(
 [
 {
 $project:{
 a:"$a",
 b:"$b",
 difference:{$let:{vars:{diff:{$subtract:["$a","$b"]}},in:"$$diff"}}}
 },
 {
 $match:{difference:{$lte:4*24*60*60*1000+2000,$gte:4*24*60*60*1000-2000}}
 }
 ])

Leave a Reply

← MapReduce Job [hadoop]

MongoDB High Availability Min configuration →