While flexible schema is how most people become familiar with MongoDB, it’s also one of the best databases (maybe even the best when it comes to everyday applications) for handling very, very large data sets. While the justification of this argument calls for a whole article in itself (I hope I can find time for it someday!), the general idea is that SQL-based solutions don’t support sharding, and building it on your sucks hard. In this article, we’ll look at one of the essential techniques for horizontal database scaling: sharding, for MongoDB, and recommend some best practices for the same. However, I feel it’s better to begin with the basics of sharding, because many people who are looking to scale MongoDB may not be very familiar with it. If you are aware of sharding, however, feel free to skim through the next section.
Sharding Basics
You may have noticed the use of the word “horizontal” in the last paragraph from the previous section. Without launching into another massive detour, I want to bring this point up quickly. Scaling is considering to be of two types: you either get a more powerful machine with higher storage capacity (vertical), or you connect several smaller computers and form a collection (horizontal). Now, given that even the best servers at present do not have more than 256 GB of RAM or 16 TB of hard disk, you hit a brick wall soon when trying to scale vertically (or “scale up,” as the terminology goes). However, you can connect as many single machines together (at least theoretically) and bypass this limitation easily. Of course, the challenge now is to coordinate among all these machines.
Database Sharding
The term “sharding” generally applies to databases, the idea being that a single machine can never be enough to hold all the data. When sharding, the database is “broken up” into separate chunks that reside on different machines. A simple example might be: suppose a business has machines that can store up to 2 million customer data items. Now, the business is reaching that break-point and will likely surpass 2.5 million users soon. So, they decide to break their database up into two:
And magically, the system capacity is now doubled! Well, if only life was that simple! 🙂
Challenges in database sharding
As soon as you were thinking a little deeper about sharding, some nefarious challenges rear their ugly head. No primary keys As soon as you step out of a single database, primary keys lose their meaning. As an example, if your primary keys are set to auto-increment and you move half of the data to another database, you’ll now have two different data items for each primary key. No foreign keys Since there’s no support in databases to point to entities outside the current database (well, even a different database on the same machine is not supported, so forget about a database on a different machine), the concept of foreign keys goes for a toss as well. Suddenly, the database becomes “dumb,” and data integrity is your problem. Weird data errors If a single machine goes out, the end-user can be shown an “Oops, something broke!” page, which will no doubt annoy, but life will be on track after some time. Now consider what happens in a sharded database. Suppose the sharded database in our earlier example is a banking database, and one customer is sending money to another. Let’s also suppose the first customer data lives in the first shard, while the second customer’s data lives in the second shard (you see where I’m going with this?!). If the machine containing the second shard fails, can you imagine what state the system will be in? Where will be the transaction money go? What will the first user see? What will the second user see? What will they both see when the shards are back online? Transaction management Let’s also consider the ever-critical case of transaction management. This time, suppose that the system is working 100% fine. Now, two people (A and B) make a payment to a third one (C). It’s very likely that both the transactions will read the account balance of C simultaneously, causing this confusion:
C’s account balance = $100. A’s transaction reads C’s balance: $100. B’s transaction reads C’s balance: $100. A’s transaction adds $50 and updates balance: $100 + 50 = $150. B’s transaction adds $50 and updates balance: $100 + 50 = $150.
Damn! $50 just disappeared into thin air! Traditional SQL systems save you from this by providing built-in transaction management, but as soon as you step out of a single machine, you’re toast. Point being, with such systems, it’s easy to run into data corruption issues from which it’s impossible to recover. Pulling your hair won’t help, either! 🙂
MongoDB Sharding
For software architects, the excitement about MongoDB wasn’t so much in its flexible schema, as in its built-in sharding support. With just a few simple rules and machines connected, you were ready to run a sharded MongoDB cluster in no time. The image below shows how this looks in a typical web app deployment. The best part about MongoDB sharding is that even balancing of shards is automatic. That is, if you have five shards and two of them are near-empty, you can tell MongoDB to rebalance things in a way that all shards are equally full. As a developer or administrator, you don’t need to worry much, as MongoDB behind the scenes does most of the heavy lifting. The same goes for the partial failure of nodes; if you have replica sets correctly configured and running on your cluster, partial outages won’t affect system uptime. The entire explanation would get rather brief, so I’ll close this section by saying that MongoDB has several built-in tools for sharding, replication, and recovery, making it very easy for developers to build large-scale applications. If you want a more comprehensive guide to MongoDB’s sharding capabilities, the official docs are the place to be. Check out this practical guide to implement Sharding. You may also be interested in this complete developer’s guide.
MongoDB Sharding Best Practices
While MongoDB “just works” out of the box for sharding, it doesn’t mean we can rest on our laurels. Sharding can make or break your project forever, depending on how well or poorly it was done. Moreover, there are many small details to account for, failing which, it’s not uncommon to see projects collapse. The intent is not to scare you, but to highlight the need for planning and to be extremely careful even with small decisions. The Sharding Key inevitably controls sharding in MongoDB, so it’s ideal that we begin our survey with that.
High cardinality
Cardinality means the amount of variation. For instance, a collection of a favorite country of 1 million people will have low variations (there are only so many countries in the world!), whereas a collection of their email addresses will have (perfectly) high cardinality. Why does it matter? Suppose you pick a naive scheme that shards data based on a user’s first name.
Here we have a rather simple arrangement; the incoming document is scanned for username, and depending on where the first letter lies in the English alphabet, it lands into one of the three shards. Similarly, searching for a document is easy: the details for “Peter,” for example, will be in the second shard for sure. It all sounds good, but the point is, we don’t control the names of the incoming document users. What if we only get names in the B to F range most of the time? If so, we’ll have what’s called a “jumbo” chunk in shard1: most of the system data will be crowded there, effectively turning the setup into a single database system. The Cure? Choose a key with high cardinality — for instance, the email address of the users, or you can even go for a compound shard key, which is a combination of multiple fields.
Monotonically Changing
A common mistake in MongoDB sharding is to use monotonically increasing (or auto-increasing, if you will) keys as the shard key. Generally, the primary key of the document is used. The idea here is well-meaning, namely, as new documents keep being created, they will fall evenly into one of the shards available. Unfortunately, such a configuration is a classic mistake. This is so because if the shard key is always increasing, after a point, data will start accumulating in the high-value side of the shards, causing an imbalance in the system. As you can see in the image, once we’re past the 20 range, all documents start getting collecting in Chunk C, causing a monolith there. The solution is to go for a hashed sharding key scheme, which creates a sharding key by hashing one of the provided fields and using that to determine the chunk. A hashed shard key looks like this: . . . and can be created in the Mongo client shell by using:
Shard Early
One of the most useful advice direct from the trenches is to shard early, even if you end up with a small, two-chunk cluster. Once data has crossed 500 GB or something, sharding becomes a messy process in MongoDB, and you should be ready for nasty surprises. Besides, the rebalancing process consumes very high amounts of network bandwidth, which can choke the system if you’re not careful. Not everyone is pro-sharding, though. As an interesting example (the learning is really in the comments), see this nice Percona blog.
Running the balancer
Another good idea is to monitor your traffic patterns and run the shard balancer only at low-traffic times. As I already mentioned, rebalancing itself takes considerable bandwidth, which could quickly bring the whole system to a crawl. Remember, imbalanced shards are not a cause for immediate panic. Just let the normal usage persist, wait for low-traffic opportunities, and let the balancer do the rest! Here’s how you might accomplish this (assuming you have low traffic from 3 am to 5 am):
Conclusion
Sharding and scaling any database is a tricky undertaking, but thankfully MongoDB makes it more manageable than other popular databases out there. There was indeed a time when MongoDB was not the right choice for any project (thanks to its several critical issues and default behaviors), but those are long gone. Along with sharding, rebalancing, auto-compression, aggregate-level distributed lock, and many such features, MongoDB has come miles ahead is today the software architect’s first choice. I hope this article was able to shed some light on what sharding is in MongoDB, and what the developer must take care of when going for scale. Next, get familiar with popular MongoDB commands.