Knowing MongoDB opLog

Replication gives you hot backups, read scaling, and all sorts of other goodness. If you know how it works you can get a lot more out of it, from how it should be configured to what you should monitor to using it directly in your applications. So, how does it work?

MongoDB’s replication is actually very simple: the master keeps a collection that describes writes and the slaves query that collection. This collection is called the oplog (short for “operation log”).

The oplog

Each write (insert, update, or delete) creates a document in the oplog collection, so long as replication is enabled (MongoDB won’t bother keeping an oplog if replication isn’t on). So, to see the oplog in action, start by running the database with the –replSet option:

$ ./mongod –replSet funWithOplogs

Now, when you do operations, you’ll be able to see them in the oplog. Let’s start out by initializing out replica set:

> rs.initiate()

Now if we query the oplog you’ll see this operation:

> use local

switched to db local

> db.oplog.rs.find()

{

“ts” : { “t” : 1286821527000, “i” : 1 },

“h” : NumberLong(0),

“op” : “n”,

“ns” : “”,

“o” : { “msg” : “initiating set” }

}

This is just an informational message for the slave, it isn’t a “real” operation. Breaking this down, it contains the following fields:

  • ts: the time this operation occurred.
  • h: a unique ID for this operation. Each operation will have a different value in this field.
  • op: the write operation that should be applied to the slave. n indicates a no-op, this is just an informational message.
  • ns: the database and collection affected by this operation. Since this is a no-op, this field is left blank.
  • o: the actual document representing the op. Since this is a no-op, this field is pretty useless.

To see some real oplog messages, we’ll need to do some writes. Let’s do a few simple ones in the shell:

> use test

switched to db test

> db.foo.insert({x:1})

> db.foo.update({x:1}, {$set : {y:1}})

> db.foo.update({x:2}, {$set : {y:1}}, true)

> db.foo.remove({x:1})

Now look at the oplog:

> use local

switched to db local

> db.oplog.rs.find()

{ “ts” : { “t” : 1286821527000, “i” : 1 }, “h” : NumberLong(0), “op” : “n”, “ns” : “”, “o” : { “msg” : “initiating set” } }

{ “ts” : { “t” : 1286821977000, “i” : 1 }, “h” : NumberLong(“1722870850266333201”), “op” : “i”, “ns” : “test.foo”, “o” : { “_id” : ObjectId(“4cb35859007cc1f4f9f7f85d”), “x” : 1 }}

{ “ts” : { “t” : 1286821984000, “i” : 1 }, “h” : NumberLong(“1633487572904743924”), “op” : “u”, “ns” : “test.foo”, “o2” : { “_id” : ObjectId(“4cb35859007cc1f4f9f7f85d”) }, “o” : {“$set” : { “y” : 1 } } }

{ “ts” : { “t” : 1286821993000, “i” : 1 }, “h” : NumberLong(“5491114356580488109”), “op” : “i”, “ns” : “test.foo”, “o” : { “_id” : ObjectId(“4cb3586928ce78a2245fbd57”), “x” : 2,”y” : 1 } }

{ “ts” : { “t” : 1286821996000, “i” : 1 }, “h” : NumberLong(“243223472855067144”), “op” : “d”, “ns” : “test.foo”, “b” : true, “o” : { “_id” : ObjectId(“4cb35859007cc1f4f9f7f85d”)} }

You can see that each operation has a ns now: “test.foo”. There are also three operations represented (the op field), corresponding to the three types of writes mentioned earlier: i for inserts, u for updates, and d for deletes.

The o field now contains the document to insert or the criteria to update and remove. Notice that, for the update, there are two o fields (o and o2). o2 give the update criteria and o gives the modifications (equivalent to update()‘s second argument).

Using this information

MongoDB doesn’t yet have triggers, but applications could hook into this collection if they’re interested in doing something every time a document is deleted (or updated, or inserted, etc.) Part three of this series will elaborate on this idea.
Our application could do billions of writes and the oplog has to record them all, but we don’t want our entire disk consumed by the oplog. To prevent this, MongoDB makes the oplog a fixed-size, or capped, collection (the oplog is actually the reason capped collections were invented).

When you start up the database for the first time, you’ll see a line that looks like:

Mon Oct 11 14:25:21 [initandlisten] creating replication oplog of size: 47MB... (use --oplogSize to change)

Your oplog is automatically allocated to be a fraction of your disk space. As the message suggests, you may want to customize it as you get to know your application.

Tip :  you should make sure you start up arbiter processes with --oplogSize 1, so that the arbiter doesn’t preallocate an oplog. There’s no harm in it doing so, but it’s a waste of space as the arbiter will never use it.

Implications of using a capped collection

The oplog is a fixed size so it will eventually fill up. At this point, it’ll start overwriting the oldest entries, like a circular queue.

It’s usually fine to overwrite the oldest operations because the slaves have already copied and applied them. Once everyone has an operation there’s no need to keep it around. However, sometimes a slave will fall very far behind and “fall off” the end of the oplog: the latest operation it knows about is before the earliest operation in the master’s oplog.

oplog time ->
<..............................>
   ^         ^    ^        ^
   |         |    |        |
      slave         master

If this occurs, the slave will start giving error messages about needing to be resynced. It can’t catch up to the master from the oplog anymore: it might miss operations between the last oplog entry it has and the master’s oldest oplog entry. It needs a full resync at this point.

Resyncing

On a resync or an initial sync, the slave will make a note of the master’s current oplog time and call thecopyDatabase command on all of the master’s databases. Once all of the master’s databases have been copied over, the slave makes a note of the time. Then it applies all of the oplog operations from the time the copy started up until the end of the copy.

Once it has completed the copy and run through the operations that happened during the copy, it is considered resynced. It can now begin replicating normally again. If so many writes occur during the resync that the slave’s oplog cannot hold them all, you’ll end up in the “need to resync” state again. If this occurs, you need to allocate a larger oplog and try again (or try it at a time when the system is has less traffic).

DIY triggers

MongoDB has a type of query that behaves like the tail -f command: it shows you new data as it’s written to a collection. This is great for the oplog, where you want to see new records as they pop up and don’t want to query over and over.

If you want this type of ongoing query, MongoDB returns a tailable cursor. When this cursor gets to the end of the result set it will hang around and wait for more elements to be added to the collection. As they’re added, the cursor will return them. If no elements are added for a while, the cursor will time out and the client has to requery if they want more results.

Using your knowledge of the oplog’s format, you can use a tailable cursor to do a long pull for activities in a certain collection, of a certain type, at a certain time… almost any criteria you can imagine.

Using the oplog for crash recovery

Suppose your database goes down, but you have a fairly recent backup. You could put a backup into production, but it’ll be a bit behind. You can bring it up-to-date using your oplog entries.

If you use the trigger mechanism (described above) to capture the entire oplog and send it to a non-capped collection on another server, you can then use an oplog replayer to play the oplog over your dump, bringing it as up-to-date as possible.

Pick a time pre-dump and start replaying the oplog from there. It’s okay if you’re not sure exactly when the dump was taken because the oplog is idempotent: you can apply it to your data as many times as you want and your data will end up the same.

Also, warning: I haven’t tried out the oplog replayer I linked to, it’s just the first one I found. There are a few different ones out there and they’re pretty easy to write.

Creating non-replicated collections

The local database contains data that is local to a given server: it won’t be replicated anywhere. This is one reason why it holds all of the replication info.

local isn’t reserved for replication stuff: you can put your own data there, too. If you do a write in the local database and then check the oplog, you’ll notice that there’s no record of the write. The oplog doesn’t track changes to the local database, since they won’t be replicated.

And now I’m all replicated out. If you’re interested in learning more about replication, check out the core documentation on it. There’s also core documentation on tailable cursors and language-specific instructions in the driver documentation.

  • Ask Question