This document outlines the basic patterns for storing user-submitted comments in a content management system (CMS.)
Overview
MongoDB provides a number of different approaches for storing data like users-comments on content from a CMS. There is no correct implementation, but there are a number of common approaches and known considerations for each approach. This case study explores the implementation details and trade offs of each option. The three basic patterns are:
This approach provides the greatest flexibility at the expense of some additional application level complexity.
These implementations make it possible to display comments in chronological or threaded order, and place no restrictions on the number of comments attached to a specific object.
Embed all comments in the “parent” document.
This approach provides the greatest possible performance for displaying comments at the expense of flexibility: the structure of the comments in the document controls the display format.
NOTE
Because of the limitondocumentsize, documents, including the original content and all comments, cannot grow beyond 16 megabytes.
A hybrid design, stores comments separately from the “parent,” but aggregates comments into a small number of documents, where each contains many comments.
Also consider that comments can be threaded, where comments are always replies to “parent” item or to another comment, which carries certain architectural requirements discussed below.
One Document per Comment
Schema
If you store each comment in its own document, the documents in your comments collection, would have the following structure:
{_id:ObjectId(...),discussion_id:ObjectId(...),slug:'34db',posted:ISODateTime(...),author:{id:ObjectId(...),name:'Rick'},text:'This is so bogus ... '}
This form is only suitable for displaying comments in chronological order. Comments store:
the discussion_id field that references the discussion parent,
a URL-compatible slug identifier,
a posted timestamp,
an author sub-document that contains a reference to a user’s profile in the id field and their name in the name field, and
the full text of the comment.
To support threaded comments, you might use a slightly different structure like the following:
{_id:ObjectId(...),discussion_id:ObjectId(...),parent_id:ObjectId(...),slug:'34db/8bda'full_slug:'2012.02.08.12.21.08:34db/2012.02.09.22.19.16:8bda',posted:ISODateTime(...),author:{id:ObjectId(...),name:'Rick'},text:'This is so bogus ... '}
This structure:
adds a parent_id field that stores the contents of the _id field of the parent comment,
modifies the slug field to hold a path composed of the parent or parent’s slug and this comment’s unique slug, and
adds a full_slug field that combines the slugs and time information to make it easier to sort documents in a threaded discussion by date.
WARNING
MongoDB can only index 1024bytes. This includes all field data, the field name, and the namespace (i.e. database name and collection name.) This may become an issue when you create an index of thefull_slug field to support sorting.
Operations
This section contains an overview of common operations for interacting with comments represented using a schema where each comment is its own document.
All examples in this document use the Python programming language and the PyMongodriver for MongoDB, but you can implement this system using any language you choose. Issue the following commands at the interactive Python shell to load the required libraries:
>>> importbson>>> importpymongo
Post a New Comment
To post a new comment in a chronologically ordered (i.e. without threading) system, use the followinginsert() operation:
To insert a comment for a system with threaded comments, you must generate the slug path andfull_slug at insert. See the following operation:
posted=datetime.utcnow()# generate the unique portions of the slug and full_slugslug_part=generate_pseudorandom_slug()full_slug_part=posted.strftime('%Y.%m.%d.%H.%M.%S')+':'+slug_part# load the parent comment (if any)ifparent_slug:parent=db.comments.find_one({'discussion_id':discussion_id,'slug':parent_slug})slug=parent['slug']+'/'+slug_partfull_slug=parent['full_slug']+'/'+full_slug_partelse:slug=slug_partfull_slug=full_slug_part# actually insert the commentdb.comments.insert({'discussion_id':discussion_id,'slug':slug,'full_slug':full_slug,'posted':posted,'author':author_info,'text':comment_text})
View Paginated Comments
To view comments that are not threaded, select all comments participating in a discussion and sort by theposted field. For example:
Because the full_slug field contains both hierarchical information (via the path) and chronological information, you can use a simple sort on the full_slug field to retrieve a threaded view:
You can retrieve a “sub-discussion,” or a comment and all of its descendants recursively, by performing a regular expression prefix query on the full_slug field:
Since you have already created indexes on {discussion_id:1,full_slug:} to support retrieving sub-discussions, you can add support for the above queries by adding an index on {discussion_id:1,slug:1}. Use the following operation in the Python shell:
This design embeds the entire discussion of a comment thread inside of the topic document. In this example, the “topic,” document holds the total content for whatever content you’re managing.
Schema
Consider the following prototype topic document:
{_id:ObjectId(...),...lotsoftopicdata...comments:[{posted:ISODateTime(...),author:{id:ObjectId(...),name:'Rick'},text:'This is so bogus ... '},...]}
This structure is only suitable for a chronological display of all comments because it embeds comments in chronological order. Each document in the array in the comments contains the comment’s date, author, and text.
NOTE
Since you’re storing the comments in sorted order, there is no need to maintain per-comment slugs.
To support threading using this design, you would need to embed comments within comments, using a structure that resembles the following:
{_id:ObjectId(...),...lotsoftopicdata...replies:[{posted:ISODateTime(...),author:{id:ObjectId(...),name:'Rick'},text:'This is so bogus ... ',replies:[{author:{...},...},...]}
Here, the replies field in each comment holds the sub-comments, which can in turn hold sub-comments.
NOTE
In the embedded document design, you give up some flexibility regarding display format, because it is difficult to display comments except as you store them in MongoDB.
If, in the future, you want to switch from chronological to threaded or from threaded to chronological, this design would make that migration quite expensive.
WARNING
Remember that BSON documents have a 16megabytesizelimit. If popular discussions grow larger than 16 megabytes, additional document growth will fail.
Additionally, when MongoDB documents grow significantly after creation you will experience greater storage fragmentation and degraded update performance while MongoDB migrates documents internally.
Operations
This section contains an overview of common operations for interacting with comments represented using a schema that embeds all comments the document of the “parent” or topic content.
NOTE
For all operations below, there is no need for any new indexes since all the operations are function within documents. Because you would retrieve these documents by the _id field, you can rely on the index that MongoDB creates automatically.
Post a new comment
To post a new comment in a chronologically ordered (i.e unthreaded) system, you need the followingupdate():
The $push operator inserts comments into the comments array in correct chronological order. For threaded discussions, the update() operation is more complex. To reply to a comment, the following code assumes that it can retrieve the ‘path’ as a list of positions, for the parent comment:
This constructs a field name of the form replies.0.replies.2... as str_path and then uses this value with the $push operator to insert the new comment into the parent comment’s replies array.
View Paginated Comments
To view the comments in a non-threaded design, you must use the $slice operator:
In the “hybrid approach” you will store comments in “buckets” that hold about 100 comments. Consider the following example bucket document:
{_id:ObjectId(...),discussion_id:ObjectId(...),bucket:1,count:42,comments:[{slug:'34db',posted:ISODateTime(...),author:{id:ObjectId(...),name:'Rick'},text:'This is so bogus ... '},...]}
Each document maintains bucket and count data that contains meta data regarding the bucket, the bucket number, and the comment count, in addition to the comments array that holds the comments themselves.
NOTE
Using a hybrid format makes storing threaded comments complex, and this specific configuration is not covered in this document.
Also, 100 comments is a soft limit for the number of comments per bucket. This value is arbitrary: choose a value that will prevent the maximum document size from growing beyond the 16MB BSONdocumentsizelimit, but large enough to ensure that most comment threads will fit in a single document. In some situations the number of comments per document can exceed 100, but this does not affect the correctness of the pattern.
Operations
This section contains a number of common operations that you may use when building a CMS using this hybrid storage model with documents that hold 100 comment “buckets.”
All examples in this document use the Python programming language and the PyMongodriver for MongoDB, but you can implement this system using any language you choose.
Post a New Comment
Updating
In order to post a new comment, you need to $push the comment onto the last bucket and $inc that bucket’s comment count. Consider the following example that queries on the basis of a discussion_idfield:
The find_and_modify() operation is an upsert,: if MongoDB cannot find a document with the correctbucket number, the find_and_modify() will create it and initialize the new document with appropriate values for count and comments.
To limit the number of comments per bucket to roughly 100, you will need to create new pages as they become necessary. Add the following logic to support this:
This update() operation includes the last known number of pages in the query to prevent a race condition where the number of pages increments twice, that would result in a nearly or totally empty document. If another process increments the number of pages, then update above does nothing.
Indexing
To support the find_and_modify() and update() operations, maintain a compound index on (discussion_id, bucket) in the comment_buckets collection, by issuing the following operation at the Python/PyMongo console:
The following function presents an example of paginating comments into pages of fixed size. This works by iterating through all comments in a discussion, and keeping a counter of both how many comments it has skipped, and how many it has returned.
deffind_comments(discussion_id,skip,limit):result=[]# Find this discussion's comment bucketsbuckets=db.comment_buckets.find({'discussion_id':discussion_id},{'bucket':1})buckets=buckets.sort('bucket')# Iterate through those buckets, making a query obtaining comments for eachforbucketinbuckets:page_query=db.comment_buckets.find_one({'discussion_id':discussion_id,'bucket':bucket['bucket']},{'count':1,'comments':{'$slice':[skip,limit]}})result.append((bucket['bucket'],page_query['comments']))skip=max(0,skip-page_query['count'])limit-=len(page_query['comments'])iflimit==0:breakreturnresult
Here, the $slice operator pulls out comments from each bucket, but only when this satisfies the skiprequirement. For example: if you have 4 buckets with 100, 102, 101, and 22 comments on each bucket, respectively, and you wish to retrieve comments where skip=300 and limit=50. Use the following algorithm:
Skip
Limit
Discussion
300
50
{$slice:[300,50]} matches nothing in bucket #1; subtract bucket #1’s count fromskip and continue.
200
50
{$slice:[200,50]} matches nothing in bucket #2; subtract bucket #2’s count fromskip and continue.
98
50
{$slice:[98,50]} matches 3 comments in bucket #3; subtract bucket #3’s countfrom skip (saturating at 0), subtract 3 from limit, and continue.
0
48
{$slice:[0,48]} matches all 22 comments in bucket #4; subtract 22 from limitand continue.
0
26
There are no more buckets; terminate loop.
NOTE
Since you already have an index on (discussion_id, bucket) in your comment_bucketscollection, MongoDB can satisfy these queries efficiently.
Retrieve a Comment via Direct Links
Query
To retrieve a comment directly without paging through all preceding pages of commentary, use the slug to find the correct bucket, and then use application logic to find the correct comment:
To perform this query efficiently you’ll need a new index on the discussion_id and comments.slugfields (i.e. {discussion_id:1comments.slug:1}.) Create this index using the following operation in the Python/PyMongo console:
For all of the architectures discussed above, you will want to the discussion_id field to participate in the shard key, if you need to shard your application.
For applications that use the “one document per comment” approach, consider using slug (or full_slug, in the case of threaded comments) fields in the shard key to allow the mongos instances to route requests by slug. Issue the following operation at the Python/PyMongo console:
In the case of comments that fully-embedded in parent content documents the determination of the shard key is outside of the scope of this document.
For hybrid documents, use the bucket number in the shard key along with the discussion_id to allow MongoDB to split popular discussions between pages while grouping discussions on the same shard. Issue the following operation at the Python/PyMongo console:
[MongoDB]: Storing Comments
This document outlines the basic patterns for storing user-submitted comments in a content management system (CMS.)
Overview
MongoDB provides a number of different approaches for storing data like users-comments on content from a CMS. There is no correct implementation, but there are a number of common approaches and known considerations for each approach. This case study explores the implementation details and trade offs of each option. The three basic patterns are:
Store each comment in its own document.
This approach provides the greatest flexibility at the expense of some additional application level complexity.
These implementations make it possible to display comments in chronological or threaded order, and place no restrictions on the number of comments attached to a specific object.
Embed all comments in the “parent” document.
This approach provides the greatest possible performance for displaying comments at the expense of flexibility: the structure of the comments in the document controls the display format.
NOTE
Because of the limit on document size, documents, including the original content and all comments, cannot grow beyond 16 megabytes.
A hybrid design, stores comments separately from the “parent,” but aggregates comments into a small number of documents, where each contains many comments.
Also consider that comments can be threaded, where comments are always replies to “parent” item or to another comment, which carries certain architectural requirements discussed below.
One Document per Comment
Schema
If you store each comment in its own document, the documents in your comments collection, would have the following structure:
This form is only suitable for displaying comments in chronological order. Comments store:
To support threaded comments, you might use a slightly different structure like the following:
This structure:
WARNING
MongoDB can only index 1024 bytes. This includes all field data, the field name, and the namespace (i.e. database name and collection name.) This may become an issue when you create an index of thefull_slug field to support sorting.
Operations
This section contains an overview of common operations for interacting with comments represented using a schema where each comment is its own document.
All examples in this document use the Python programming language and the PyMongo driver for MongoDB, but you can implement this system using any language you choose. Issue the following commands at the interactive Python shell to load the required libraries:
Post a New Comment
To post a new comment in a chronologically ordered (i.e. without threading) system, use the followinginsert() operation:
To insert a comment for a system with threaded comments, you must generate the slug path andfull_slug at insert. See the following operation:
View Paginated Comments
To view comments that are not threaded, select all comments participating in a discussion and sort by theposted field. For example:
Because the full_slug field contains both hierarchical information (via the path) and chronological information, you can use a simple sort on the full_slug field to retrieve a threaded view:
SEE ALSO
cursor.limit, cursor.skip, and cursor.sort
Indexing
To support the above queries efficiently, maintain two compound indexes, on:
Issue the following operation at the interactive Python shell.
NOTE
Ensure that you always sort by the final element in a compound index to maximize the performance of these queries.
Retrieve Comments via Direct Links
Queries
To directly retrieve a comment, without needing to page through all comments, you can select by the slugfield:
You can retrieve a “sub-discussion,” or a comment and all of its descendants recursively, by performing a regular expression prefix query on the full_slug field:
Indexing
Since you have already created indexes on { discussion_id: 1, full_slug: } to support retrieving sub-discussions, you can add support for the above queries by adding an index on {discussion_id: 1 , slug: 1 }. Use the following operation in the Python shell:
Embedding All Comments
This design embeds the entire discussion of a comment thread inside of the topic document. In this example, the “topic,” document holds the total content for whatever content you’re managing.
Schema
Consider the following prototype topic document:
This structure is only suitable for a chronological display of all comments because it embeds comments in chronological order. Each document in the array in the comments contains the comment’s date, author, and text.
NOTE
Since you’re storing the comments in sorted order, there is no need to maintain per-comment slugs.
To support threading using this design, you would need to embed comments within comments, using a structure that resembles the following:
Here, the replies field in each comment holds the sub-comments, which can in turn hold sub-comments.
NOTE
In the embedded document design, you give up some flexibility regarding display format, because it is difficult to display comments except as you store them in MongoDB.
If, in the future, you want to switch from chronological to threaded or from threaded to chronological, this design would make that migration quite expensive.
WARNING
Remember that BSON documents have a 16 megabyte size limit. If popular discussions grow larger than 16 megabytes, additional document growth will fail.
Additionally, when MongoDB documents grow significantly after creation you will experience greater storage fragmentation and degraded update performance while MongoDB migrates documents internally.
Operations
This section contains an overview of common operations for interacting with comments represented using a schema that embeds all comments the document of the “parent” or topic content.
NOTE
For all operations below, there is no need for any new indexes since all the operations are function within documents. Because you would retrieve these documents by the _id field, you can rely on the index that MongoDB creates automatically.
Post a new comment
To post a new comment in a chronologically ordered (i.e unthreaded) system, you need the followingupdate():
The $push operator inserts comments into the comments array in correct chronological order. For threaded discussions, the update() operation is more complex. To reply to a comment, the following code assumes that it can retrieve the ‘path’ as a list of positions, for the parent comment:
This constructs a field name of the form replies.0.replies.2... as str_path and then uses this value with the $push operator to insert the new comment into the parent comment’s replies array.
View Paginated Comments
To view the comments in a non-threaded design, you must use the $slice operator:
To return paginated comments for the threaded design, you must retrieve the whole document and paginate the comments within the application:
Retrieve a Comment via Direct Links
Instead of retrieving comments via slugs as above, the following example retrieves comments using their position in the comment list or tree.
For chronological (i.e. non-threaded) comments, just use the $slice operator to extract a comment, as follows:
For threaded comments, you must find the correct path through the tree in your application, as follows:
NOTE
Since parent comments embed child replies, this operation actually retrieves the entire sub-discussion for the comment you queried for.
SEE
find_one().
Hybrid Schema Design
Schema
In the “hybrid approach” you will store comments in “buckets” that hold about 100 comments. Consider the following example bucket document:
Each document maintains bucket and count data that contains meta data regarding the bucket, the bucket number, and the comment count, in addition to the comments array that holds the comments themselves.
NOTE
Using a hybrid format makes storing threaded comments complex, and this specific configuration is not covered in this document.
Also, 100 comments is a soft limit for the number of comments per bucket. This value is arbitrary: choose a value that will prevent the maximum document size from growing beyond the 16MB BSON documentsize limit, but large enough to ensure that most comment threads will fit in a single document. In some situations the number of comments per document can exceed 100, but this does not affect the correctness of the pattern.
Operations
This section contains a number of common operations that you may use when building a CMS using this hybrid storage model with documents that hold 100 comment “buckets.”
All examples in this document use the Python programming language and the PyMongo driver for MongoDB, but you can implement this system using any language you choose.
Post a New Comment
Updating
In order to post a new comment, you need to $push the comment onto the last bucket and $inc that bucket’s comment count. Consider the following example that queries on the basis of a discussion_idfield:
The find_and_modify() operation is an upsert,: if MongoDB cannot find a document with the correctbucket number, the find_and_modify() will create it and initialize the new document with appropriate values for count and comments.
To limit the number of comments per bucket to roughly 100, you will need to create new pages as they become necessary. Add the following logic to support this:
This update() operation includes the last known number of pages in the query to prevent a race condition where the number of pages increments twice, that would result in a nearly or totally empty document. If another process increments the number of pages, then update above does nothing.
Indexing
To support the find_and_modify() and update() operations, maintain a compound index on (discussion_id, bucket) in the comment_buckets collection, by issuing the following operation at the Python/PyMongo console:
View Paginated Comments
The following function presents an example of paginating comments into pages of fixed size. This works by iterating through all comments in a discussion, and keeping a counter of both how many comments it has skipped, and how many it has returned.
Here, the $slice operator pulls out comments from each bucket, but only when this satisfies the skiprequirement. For example: if you have 4 buckets with 100, 102, 101, and 22 comments on each bucket, respectively, and you wish to retrieve comments where skip=300 and limit=50. Use the following algorithm:
NOTE
Since you already have an index on (discussion_id, bucket) in your comment_bucketscollection, MongoDB can satisfy these queries efficiently.
Retrieve a Comment via Direct Links
Query
To retrieve a comment directly without paging through all preceding pages of commentary, use the slug to find the correct bucket, and then use application logic to find the correct comment:
Indexing
To perform this query efficiently you’ll need a new index on the discussion_id and comments.slugfields (i.e. { discussion_id: 1 comments.slug: 1 }.) Create this index using the following operation in the Python/PyMongo console:
Sharding
For all of the architectures discussed above, you will want to the discussion_id field to participate in the shard key, if you need to shard your application.
For applications that use the “one document per comment” approach, consider using slug (or full_slug, in the case of threaded comments) fields in the shard key to allow the mongos instances to route requests by slug. Issue the following operation at the Python/PyMongo console:
This will return the following response:
In the case of comments that fully-embedded in parent content documents the determination of the shard key is outside of the scope of this document.
For hybrid documents, use the bucket number in the shard key along with the discussion_id to allow MongoDB to split popular discussions between pages while grouping discussions on the same shard. Issue the following operation at the Python/PyMongo console: