[MongoDB]: Storing Comments

This document outlines the basic patterns for storing user-submitted comments in a content management system (CMS.)

Overview

MongoDB provides a number of different approaches for storing data like users-comments on content from a CMS. There is no correct implementation, but there are a number of common approaches and known considerations for each approach. This case study explores the implementation details and trade offs of each option. The three basic patterns are:

  1. Store each comment in its own document.

    This approach provides the greatest flexibility at the expense of some additional application level complexity.

    These implementations make it possible to display comments in chronological or threaded order, and place no restrictions on the number of comments attached to a specific object.

  2. Embed all comments in the “parent” document.

    This approach provides the greatest possible performance for displaying comments at the expense of flexibility: the structure of the comments in the document controls the display format.

    NOTE

    Because of the limit on document size, documents, including the original content and all comments, cannot grow beyond 16 megabytes.

  3. A hybrid design, stores comments separately from the “parent,” but aggregates comments into a small number of documents, where each contains many comments.

Also consider that comments can be threaded, where comments are always replies to “parent” item or to another comment, which carries certain architectural requirements discussed below.

One Document per Comment

Schema

If you store each comment in its own document, the documents in your comments collection, would have the following structure:

{
    _id: ObjectId(...),
    discussion_id: ObjectId(...),
    slug: '34db',
    posted: ISODateTime(...),
    author: {
              id: ObjectId(...),
              name: 'Rick'
             },
    text: 'This is so bogus ... '
}

This form is only suitable for displaying comments in chronological order. Comments store:

  • the discussion_id field that references the discussion parent,
  • a URL-compatible slug identifier,
  • a posted timestamp,
  • an author sub-document that contains a reference to a user’s profile in the id field and their name in the name field, and
  • the full text of the comment.

To support threaded comments, you might use a slightly different structure like the following:

{
    _id: ObjectId(...),
    discussion_id: ObjectId(...),
    parent_id: ObjectId(...),
    slug: '34db/8bda'
    full_slug: '2012.02.08.12.21.08:34db/2012.02.09.22.19.16:8bda',
    posted: ISODateTime(...),
    author: {
              id: ObjectId(...),
              name: 'Rick'
             },
    text: 'This is so bogus ... '
}

This structure:

  • adds a parent_id field that stores the contents of the _id field of the parent comment,
  • modifies the slug field to hold a path composed of the parent or parent’s slug and this comment’s unique slug, and
  • adds a full_slug field that combines the slugs and time information to make it easier to sort documents in a threaded discussion by date.

WARNING

MongoDB can only index 1024 bytes. This includes all field data, the field name, and the namespace (i.e. database name and collection name.) This may become an issue when you create an index of thefull_slug field to support sorting.

Operations

This section contains an overview of common operations for interacting with comments represented using a schema where each comment is its own document.

All examples in this document use the Python programming language and the PyMongo driver for MongoDB, but you can implement this system using any language you choose. Issue the following commands at the interactive Python shell to load the required libraries:

>>> import bson
>>> import pymongo

Post a New Comment

To post a new comment in a chronologically ordered (i.e. without threading) system, use the followinginsert() operation:

slug = generate_pseudorandom_slug()
db.comments.insert({
    'discussion_id': discussion_id,
    'slug': slug,
    'posted': datetime.utcnow(),
    'author': author_info,
    'text': comment_text })

To insert a comment for a system with threaded comments, you must generate the slug path andfull_slug at insert. See the following operation:

posted = datetime.utcnow()

# generate the unique portions of the slug and full_slug
slug_part = generate_pseudorandom_slug()
full_slug_part = posted.strftime('%Y.%m.%d.%H.%M.%S') + ':' + slug_part
# load the parent comment (if any)
if parent_slug:
    parent = db.comments.find_one(
        {'discussion_id': discussion_id, 'slug': parent_slug })
    slug = parent['slug'] + '/' + slug_part
    full_slug = parent['full_slug'] + '/' + full_slug_part
else:
    slug = slug_part
    full_slug = full_slug_part

# actually insert the comment
db.comments.insert({
    'discussion_id': discussion_id,
    'slug': slug,
    'full_slug': full_slug,
    'posted': posted,
    'author': author_info,
    'text': comment_text })

View Paginated Comments

To view comments that are not threaded, select all comments participating in a discussion and sort by theposted field. For example:

cursor = db.comments.find({'discussion_id': discussion_id})
cursor = cursor.sort('posted')
cursor = cursor.skip(page_num * page_size)
cursor = cursor.limit(page_size)

Because the full_slug field contains both hierarchical information (via the path) and chronological information, you can use a simple sort on the full_slug field to retrieve a threaded view:

cursor = db.comments.find({'discussion_id': discussion_id})
cursor = cursor.sort('full_slug')
cursor = cursor.skip(page_num * page_size)
cursor = cursor.limit(page_size)

Indexing

To support the above queries efficiently, maintain two compound indexes, on:

  1. (``discussion_id, posted)“ and
  2. (``discussion_id, full_slug)“

Issue the following operation at the interactive Python shell.

>>> db.comments.ensure_index([
...    ('discussion_id', 1), ('posted', 1)])
>>> db.comments.ensure_index([
...    ('discussion_id', 1), ('full_slug', 1)])

NOTE

Ensure that you always sort by the final element in a compound index to maximize the performance of these queries.

Embedding All Comments

This design embeds the entire discussion of a comment thread inside of the topic document. In this example, the “topic,” document holds the total content for whatever content you’re managing.

Schema

Consider the following prototype topic document:

{
    _id: ObjectId(...),
    ... lots of topic data ...
    comments: [
        { posted: ISODateTime(...),
          author: { id: ObjectId(...), name: 'Rick' },
          text: 'This is so bogus ... ' },
       ... ]
}

This structure is only suitable for a chronological display of all comments because it embeds comments in chronological order. Each document in the array in the comments contains the comment’s date, author, and text.

NOTE

Since you’re storing the comments in sorted order, there is no need to maintain per-comment slugs.

To support threading using this design, you would need to embed comments within comments, using a structure that resembles the following:

{
    _id: ObjectId(...),
    ... lots of topic data ...
    replies: [
        { posted: ISODateTime(...),
          author: { id: ObjectId(...), name: 'Rick' },
          text: 'This is so bogus ... ',
          replies: [
              { author: { ... }, ... },
       ... ]
}

Here, the replies field in each comment holds the sub-comments, which can in turn hold sub-comments.

NOTE

In the embedded document design, you give up some flexibility regarding display format, because it is difficult to display comments except as you store them in MongoDB.

If, in the future, you want to switch from chronological to threaded or from threaded to chronological, this design would make that migration quite expensive.

WARNING

Remember that BSON documents have a 16 megabyte size limit. If popular discussions grow larger than 16 megabytes, additional document growth will fail.

Additionally, when MongoDB documents grow significantly after creation you will experience greater storage fragmentation and degraded update performance while MongoDB migrates documents internally.

Operations

This section contains an overview of common operations for interacting with comments represented using a schema that embeds all comments the document of the “parent” or topic content.

NOTE

For all operations below, there is no need for any new indexes since all the operations are function within documents. Because you would retrieve these documents by the _id field, you can rely on the index that MongoDB creates automatically.

Post a new comment

To post a new comment in a chronologically ordered (i.e unthreaded) system, you need the followingupdate():

db.discussion.update(
    { 'discussion_id': discussion_id },
    { '$push': { 'comments': {
        'posted': datetime.utcnow(),
        'author': author_info,
        'text': comment_text } } } )

The $push operator inserts comments into the comments array in correct chronological order. For threaded discussions, the update() operation is more complex. To reply to a comment, the following code assumes that it can retrieve the ‘path’ as a list of positions, for the parent comment:

if path != []:
    str_path = '.'.join('replies.%d' % part for part in path)
    str_path += '.replies'
else:
    str_path = 'replies'
db.discussion.update(
    { 'discussion_id': discussion_id },
    { '$push': {
        str_path: {
            'posted': datetime.utcnow(),
            'author': author_info,
            'text': comment_text } } } )

This constructs a field name of the form replies.0.replies.2... as str_path and then uses this value with the $push operator to insert the new comment into the parent comment’s replies array.

View Paginated Comments

To view the comments in a non-threaded design, you must use the $slice operator:

discussion = db.discussion.find_one(
    {'discussion_id': discussion_id},
    { ... some fields relevant to your page from the root discussion ...,
      'comments': { '$slice': [ page_num * page_size, page_size ] }
    })

To return paginated comments for the threaded design, you must retrieve the whole document and paginate the comments within the application:

discussion = db.discussion.find_one({'discussion_id': discussion_id})

def iter_comments(obj):
    for reply in obj['replies']:
        yield reply
        for subreply in iter_comments(reply):
            yield subreply

paginated_comments = itertools.slice(
    iter_comments(discussion),
    page_size * page_num,
    page_size * (page_num + 1))

Hybrid Schema Design

Schema

In the “hybrid approach” you will store comments in “buckets” that hold about 100 comments. Consider the following example bucket document:

{
    _id: ObjectId(...),
    discussion_id: ObjectId(...),
    bucket: 1,
    count: 42,
    comments: [ {
        slug: '34db',
        posted: ISODateTime(...),
        author: { id: ObjectId(...), name: 'Rick' },
        text: 'This is so bogus ... ' },
    ... ]
}

Each document maintains bucket and count data that contains meta data regarding the bucket, the bucket number, and the comment count, in addition to the comments array that holds the comments themselves.

NOTE

Using a hybrid format makes storing threaded comments complex, and this specific configuration is not covered in this document.

Also, 100 comments is a soft limit for the number of comments per bucket. This value is arbitrary: choose a value that will prevent the maximum document size from growing beyond the 16MB BSON documentsize limit, but large enough to ensure that most comment threads will fit in a single document. In some situations the number of comments per document can exceed 100, but this does not affect the correctness of the pattern.

Operations

This section contains a number of common operations that you may use when building a CMS using this hybrid storage model with documents that hold 100 comment “buckets.”

All examples in this document use the Python programming language and the PyMongo driver for MongoDB, but you can implement this system using any language you choose.

Post a New Comment

Updating

In order to post a new comment, you need to $push the comment onto the last bucket and $inc that bucket’s comment count. Consider the following example that queries on the basis of a discussion_idfield:

bucket = db.comment_buckets.find_and_modify(
    { 'discussion_id': discussion['_id'],
      'bucket': discussion['num_buckets'] },
    { '$inc': { 'count': 1 },
      '$push': {
          'comments': { 'slug': slug, ... } } },
    fields={'count':1},
    upsert=True,
    new=True )

The find_and_modify() operation is an upsert,: if MongoDB cannot find a document with the correctbucket number, the find_and_modify() will create it and initialize the new document with appropriate values for count and comments.

To limit the number of comments per bucket to roughly 100, you will need to create new pages as they become necessary. Add the following logic to support this:

if bucket['count'] > 100:
    db.discussion.update(
        { 'discussion_id: discussion['_id'],
          'num_buckets': discussion['num_buckets'] },
        { '$inc': { 'num_buckets': 1 } } )

This update() operation includes the last known number of pages in the query to prevent a race condition where the number of pages increments twice, that would result in a nearly or totally empty document. If another process increments the number of pages, then update above does nothing.

Indexing

To support the find_and_modify() and update() operations, maintain a compound index on (discussion_id, bucket) in the comment_buckets collection, by issuing the following operation at the Python/PyMongo console:

>>> db.comment_buckets.ensure_index([
...    ('discussion_id', 1), ('bucket', 1)])

View Paginated Comments

The following function presents an example of paginating comments into pages of fixed size. This works by iterating through all comments in a discussion, and keeping a counter of both how many comments it has skipped, and how many it has returned.

def find_comments(discussion_id, skip, limit):
    result = []

    # Find this discussion's comment buckets
    buckets = db.comment_buckets.find(
        { 'discussion_id': discussion_id },
        { 'bucket': 1 })
    buckets = buckets.sort('bucket')

    # Iterate through those buckets, making a query obtaining comments for each
    for bucket in buckets:
        page_query = db.comment_buckets.find_one(
            { 'discussion_id': discussion_id, 'bucket': bucket['bucket'] },
            { 'count': 1, 'comments': { '$slice': [ skip, limit ] }})
        result.append((bucket['bucket'], page_query['comments']))
        skip = max(0, skip - page_query['count'])
        limit -= len(page_query['comments'])
        if limit == 0: break

    return result

Here, the $slice operator pulls out comments from each bucket, but only when this satisfies the skiprequirement. For example: if you have 4 buckets with 100, 102, 101, and 22 comments on each bucket, respectively, and you wish to retrieve comments where skip=300 and limit=50. Use the following algorithm:

Skip Limit Discussion
300 50 {$slice: [ 300, 50 ] } matches nothing in bucket #1; subtract bucket #1’s count fromskip and continue.
200 50 {$slice: [ 200, 50 ] } matches nothing in bucket #2; subtract bucket #2’s count fromskip and continue.
98 50 {$slice: [ 98, 50 ] } matches 3 comments in bucket #3; subtract bucket #3’s countfrom skip (saturating at 0), subtract 3 from limit, and continue.
0 48 {$slice: [ 0, 48 ] } matches all 22 comments in bucket #4; subtract 22 from limitand continue.
0 26 There are no more buckets; terminate loop.

NOTE

Since you already have an index on (discussion_id, bucket) in your comment_bucketscollection, MongoDB can satisfy these queries efficiently.

Retrieve a Comment via Direct Links

Query

To retrieve a comment directly without paging through all preceding pages of commentary, use the slug to find the correct bucket, and then use application logic to find the correct comment:

bucket = db.comment_buckets.find_one(
    { 'discussion_id': discussion_id,
      'comments.slug': comment_slug},
    { 'comments': 1 })
for comment in bucket['comments']:
    if comment['slug'] = comment_slug:
        break
Indexing

To perform this query efficiently you’ll need a new index on the discussion_id and comments.slugfields (i.e. { discussion_id: 1 comments.slug: 1 }.) Create this index using the following operation in the Python/PyMongo console:

>>> db.comment_buckets.ensure_index([
...    ('discussion_id', 1), ('comments.slug', 1)])

Sharding

For all of the architectures discussed above, you will want to the discussion_id field to participate in the shard key, if you need to shard your application.

For applications that use the “one document per comment” approach, consider using slug (or full_slug, in the case of threaded comments) fields in the shard key to allow the mongos instances to route requests by slug. Issue the following operation at the Python/PyMongo console:

>>> db.command('shardCollection', 'comments', {
...     'key' : { 'discussion_id' : 1, 'full_slug': 1 } })

This will return the following response:

{ "collectionsharded" : "comments", "ok" : 1 }

In the case of comments that fully-embedded in parent content documents the determination of the shard key is outside of the scope of this document.

For hybrid documents, use the bucket number in the shard key along with the discussion_id to allow MongoDB to split popular discussions between pages while grouping discussions on the same shard. Issue the following operation at the Python/PyMongo console:

>>> db.command('shardCollection', 'comment_buckets', {
...     key : { 'discussion_id' : 1, 'bucket': 1 } })
{ "collectionsharded" : "comment_buckets", "ok" : 1 }

  • Ask Question