[MongoDB]: Different ways to Optimize your Cluster With Tag-Aware Sharding
MongoDB supports partitioning data using either a user-defined shard key index (composed of one or more fields present in all documents) or based on a hashed index on a single field (which also much be present in all documents). A lesser known feature is tag-aware sharding, which allows you to create (and adjust) policies that affect where documents are stored based on user-defined tags and tag ranges.
Tag-aware sharding is a very versatile feature: it can be used to maximize the performance of your cluster, optimize storage, and implement data center-awareness in your application. In this post, we’ll go through a few use cases for tag aware sharding.
Optimizing Physical Resources
In some cases, an application needs to store data that is not frequently accessed. Financial institutions are often required to save transaction details or claims for a number of years to stay in compliance with government regulation. It can be very expensive to keep several years worth of data on high-performance hardware, so a useful approach would be to use periodic batch jobs to move data to a more cost-effective archive storage—an operational headache for most applications. There are a number of failure points when moving data from one machine to the next and there is a lot of tedious application logic needed in order to execute these operations. Using MongoDB’s Tag-Aware sharding feature, you could assign tags to different storage tiers, associate a unique shard key range to a tag, and documents will move from one shard to the next during normal balancing operations. To ensure you’re moving the right number of document from one machine to the next over time, you could write a script to update shard tags once a month. This method is great because it keeps all the data in one database, so your application code doesn’t need to change as data moves between storage tiers. You can read an in-depth overview of this implementation in our blog post on tiered storage.
Location-Based Separation of Data
If you have a large, global user base, it is often preferable to store specific user information in localized data centers. In this case, you can use shard tags to keep documents to specified data center(s). There is a great tutorial on this in the (MongoDB Manual)[http://docs.mongodb.org/manual/data-center-awareness/]. While this is a classic use case for tag-aware sharding, there are a few things to look out for when implementing this new approach yourself.
Remember that Shard Keys are immutable: you cannot change the value of your shard key, so if a user happens to change their location, they will in turn change their shard key, and you will need to delete and re-insert that document to change that information. See more about Shard Keys in our blog post here.
You should include the shard key with every operation to avoid scattering the query to all shards. This may seem counter-intuitive if your shard key is “region, customerId” but it’s necessary to make sure that only shards holding a particular value of “region” field are sent the query.
If you’re interested in learning more about this example, take a look at our new white paper on multi datacenter deployments which discusses sharding along with other configurations for distributing databases across regions.
Balancing unsharded collections across a sharded cluster
Sharding can help with many dimensions of scalability, including read and write scalability. With tag aware sharding, you can take this one step further, by balancing collections across shards to distribute read and write load. This would happen in a case where you have a large number of medium sized collections on a single shard — not typically big enough to warrant sharding— that make your workload unbalanced. Instead of splitting all of these medium-sized collections across all shards, you can distribute them “round-robin” style—one or several collections living on each shard. Some examples of how to implement that are here
In the wild, we have seen this implemented for a “platform as a service” (i.e. an application development platform). If you have many tenants—or individual apps— and none of them are large enough to need sharding, there still may be too many to live on a single replica set. Having each tenant’s application on an individual shard would help distribute the load by splitting individual collections on to specific shards or a subset of shards. You would not have to partition any of the collections (which you would do if you sharded the collection) but they would live on individual shards, enabling you to balance your read and write load.
Targeting data to individual servers
Balancing collections can also be used to optimize physical resources. In some cases, your collections will have heavy indexes that are taking up a lot of memory. Rather than spreading them out on different machines, you can target that collection to a shard with more RAM, allowing you to separate it from other collections to reduce contention for limited physical resources.
There are a number of ways to implement tag-aware sharding into your infrastructure to increase read and write efficiency, get better use of your hardware and improve overall performance in your application. If you want to learn more:
- See how Bugsnag implemented tag aware sharding to optimize their read and write load on specific collections
- Read more about the Tiered Storage model
- Learn more about multi-data center deployments in our free whitepaper