Software Development

MongoDB Sharding Guide

This article is part of our Academy Course titled MongoDB – A Scalable NoSQL DB.

In this course, you will get introduced to MongoDB. You will learn how to install it and how to operate it via its shell. Moreover, you will learn how to programmatically access it via Java and how to leverage Map Reduce with it. Finally, more advanced concepts like sharding and replication will be explained. Check it out here!

1. Introduction

Sharding is a technique used to split large amount of data across multiple server instances. Nowadays, data volumes are growing exponentially and a single physical server often is not able to store and manage such a mass. Sharding helps in solving such issues by dividing the whole data set into smaller parts and distributing them across large number of servers (or shards). Collectively, all shards make up a whole data set or, in terms of MongoDB , a single logical database.

2. Configuration

MongoDB supports sharding out of the box using sharded clusters configurations: config servers (three for production deployments), one or more shards (replica sets for production deployments), and one or more query routing processes.

  • shards store the data: for high availability and data consistency reason, each shard should be a replica set (we will talk more about replication in Part 5. MongoDB Replication Guide)
  • query routers (mongos processes) forward client applications and direct operations to the appropriate shard or shards (a sharded cluster can contain more than one query router to balance the load)
  • config servers store the cluster’s metadata: mapping of the cluster’s data set to the shards (the query routers use this metadata to target operations to specific shards), for high availability the production sharded cluster should have 3 config servers but for testing purposes a cluster with a single config server may be deployed

MongoDB shards (partitions) data on a per-collection basis. The sharded collections in a sharded cluster should be accessed through the query routers (mongos processes) only. Direct connection to a shard gives access to only a fraction of the cluster’s data. Additionally, every database has a so called primary shard that holds all the unsharded collections in that database.

Config servers are special mongod process instances that store the metadata for a single sharded cluster. Each cluster must have its own config servers. Config servers use a two-phase commit protocol to ensure consistency and reliability. All config servers must be available to deploy a sharded cluster or to make any changes to cluster metadata. Clusters become inoperable without the cluster metadata.

3. Sharding (Partitioning) Schemes

MongoDB distributes data across shards at the collection level by partitioning a collection’s data by the shard key (for more details, please refer to official documentation).

Shard key is either an indexed simple or compound field that exists in every document in the collection. Using either range-based partitioning or hash-based partitioning, MongoDB splits the shard key values into chunks and distributes those chunks evenly across the shards. To do that, MongoDB uses two background processes:

  • splitting: a background process that keeps chunks from growing too large. When a particular chunk grows beyond a specified chunk size (by default, 64 Mb), MongoDB splits the chunk in a half. Inserts and updates of the sharded collection may triggers splits.
  • balancer: a background process that manages chunk migrations. The balancer runs on every query router in a cluster. When the distribution of a sharded collection in a cluster is uneven, the balancer moves chunks from the shard that has the largest number of chunks to the shard with the lowest number of chunks until balance is reached.

3.1. Range-based sharding (partitioning)

In range-base sharding (partitioning), MongoDB splits the data set into ranges determined by the shard key values. For example, if shard key is numeric field, MongoDB partitions the whole field’s value range into smaller, non-overlapping intervals (chunks).

Although range-based sharding (partitioning) supports more efficient range queries, it may result into uneven data distribution.

3.2. Hash-based sharding (partitioning)

In hash-based sharding (partitioning), MongoDB computes the hash value of the shard key and then uses these hashes to split the data between shards. Hash-based partitioning ensures an even distribution of data but leads to inefficient range queries.

4. Adding / Removing Shards

MongoDB allows adding and removing shards to/from running shared cluster. When new shards are added, it breaks the balance since the new shards have no chunks. While MongoDB begins migrating data to the new shards immediately, it can take a while before the cluster restores the balance.

Accordingly, when shards are being removed from the cluster, the balancer migrates all chunks from those shards to other shards. When migration is done, the shards can be safely removed.

5. Configuring Sharded Cluster

Now that we understand the MongoDB sharding basics, we are going to deploy a small sharded cluster from scratch using simplified (test-like) configuration without replica sets and single config server instance.

The deployment of sharded cluster begins with running mongod server process in a config server mode using the --configsvr command line argument. Each config server requires its own data directory to store a complete the cluster’s metadata and that should be provided using --dbpath command line argument (which we have already seen in Part 1. MongoDB Installation – How to install MongoDB). That being said, let us perform the following steps:

  1. Create a data folder (we need to do that only once): mkdir configdb
  2. Run MongoDB server: bin/mongod --configsvr --dbpath configdb

Picture 1. Config server has been started successfully.
Picture 1. Config server has been started successfully.

By default, config servers listen on a port 27019. All config servers should be up and running before starting up a sharded cluster. In our deployment, the single config server is run on a host with name ubuntu.

Next step is to run the instances of mongos processes. Those are lightweight and require only knowing the location of config servers. For that, the command line argument --configdb is being used followed by comma-separated config server names (in our case, only single config server on host ubuntu):

bin/mongos --configdb ubuntu

Picture 2. Single mongos instance has been started successfully and connected to config server.
Picture 2. Single mongos instance has been started successfully and connected to config server.

The default port mongos process listens to is 27017, the same as a usual MongoDB server.

The final step is to add some regular MongoDB server instances (shards) into the sharded cluster. Les us start two MongoDB server instances on different ports, 27000 and 27001 respectively.

    1. Run first MongoDB server instance (shard) on port 27000
mkdir data-shard1
bin/mongod --dbpath data-shard1 --port 27000
    1. Run second MongoDB server instance (shard) on port 27001
mkdir data-shard2
bin/mongod --dbpath data-shard2 --port 27001

Once the standalone MongoDB server instances are up and running, it is a time to add them into sharded cluster with the help of MongoDB shell. Up to now, we have used MongoDB shell to connect to standalone MongoDB servers. In sharded cluster, the mongos processesare the entry points for all clients and as such we are going to connect to the one we have just run (running on default port 27017): bin/mongo --port 27017 --host ubuntu

MongoDB shell provides a useful command sh.addShard() for adding shards (please refer to Shell Sharding Command Helpers for more details). Let us issue this command against the two standalone MongoDB server instances we have run before:

sh.addShard( "ubuntu:27000" ) 
sh.addShard( "ubuntu:27001" )

Let us check the current sharded cluster configuration by issuing another very helpful MongoDB shell command sh.status().

At this point the sharded cluster infrastructure is configured and we can move on with sharding database collections.

Shard tags

MongoDB allows to associate tags to specific ranges of a shard key with a specific shard or subset of shards. A single shard may have multiple tags, and multiple shards may also have the same tag. But any given shard key range may only have one assigned tag. The overlapping of the ranges is not allowed as well as tagging the same range more than once.

For example, let us assign tags to each of the shards in our sharded cluster by using sh.addShardTag() command.

sh.addShardTag( "shard0001", "bestsellers" )
sh.addShardTag( "shard0000", "others" )

To find all shards associated with a particular tag, the command against config database should be issued. For example, to find all shards tagged as “bestsellers” the following commands should be typed in MongoDB shell:

use config
db.shards.find( { tags: "bestsellers" } )

Picture 6. Finding all shards tagged as "bestsellers".
Picture 6. Finding all shards tagged as “bestsellers”.

Another collection in config database named tags contains all tags definitions and may be queried the regular way for all available tags.

Respectively, tags could be removed from the shard using the sh.removeShardTag()command. For example:

sh.removeShardTag( "shard0000", "others" )

6. Sharding databases and collections

As we already know, MongoDB performs sharding on collection level. But before a collection could be sharded, the sharding must be enabled for the collection’s database. Enabling sharding for a database does not redistribute data but makes it possible to shard the collections in that database.

For demonstration purposes, we are going to reuse the bookstore example from Part 3. MongoDB and Java Tutorial and make the books collection shardable. Let us reconnect to mongos instance by providing additionally the database name bin/mongo --port 27017 --host ubuntu bookstore(or just issue the command use bookstorein the existing MongoDB shell session) and insert couple of documents into the books collection.

db.books.insert( { 
    "title" : "MongoDB: The Definitive Guide", 
    "published" : "2013-05-23", 
    "categories" : [ "Databases", "NoSQL", "Programming" ], 
    "publisher" : { "name" : "O'Reilly" } 
} )

db.books.insert( { 
    "title" : "MongoDB Applied Design Patterns", 
    "published" : "2013-03-19", 
	"categories" : [ "Databases", "NoSQL", "Patterns", "Programming" ], 
	"publisher" : { "name" : "O'Reilly" } 
} )

db.books.insert( { 
    "title" : "MongoDB in Action", 
	"published" : "2011-12-16", 
	"categories" : [ "Databases", "NoSQL", "Programming" ], 
	"publisher" : { "name" : "Manning" } 
} )

db.books.insert( { 
    "title" : "NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence", 
	"published" : "2012-08-18", 
	"categories" : [ "Databases", "NoSQL" ], 
	"publisher" : { "name" : "Addison Wesley" } 
} )

As we mentioned before, sharding is not yet enabled nor for bookstore database nor for books collection so the whole dataset ends up on primary shard (as we mentioned in Introduction section). Let us enable sharding for bookstore database by issuing the command sh.enableSharding() in MongoDB shell.

Picture 7. Enable sharding for bookstore database.
Picture 7. Enable sharding for bookstore database.

We are getting very close to have our sharded cluster to actually do some work and to start sharding real collections. Each collection should have a shard key (please refer to Sharding (Partitioning) Schemes section) in order to be partitioned across multiple shards. Let us define a hash-based sharding key (please refer to Hash-based sharding (partitioning) section) for books collection based on document _id field.

db.books.ensureIndex( { "_id": "hashed" } )

With that, let us tell MongoDB to shard the books collection using sh.shardCollection() command (please see Sharding commands and command helpers for more details).

sh.shardCollection( "bookstore.books", { "_id": "hashed" } )

Picture 8. Enable sharding for books collection.
Picture 8. Enable sharding for books collection.

And with that, the sharded cluster is up and running! In the next section we are going to look closely on the commands available specifically to manage sharded deployments.

7. Sharding commands and command helpers

MongoDB shell provides a command helpers and sh context variable to simplify sharding management and deployment.

CommandlistShards
DescriptionOutputs a list of configured shards. It should be run in context of admin database.
ExampleIn MongoDB shell, let us issue the command:

db.adminCommand( { listShards: 1 } )

Referencehttp://docs.mongodb.org/manual/reference/command/listShards/

listShards

 

CommandshardConnPoolStats
DescriptionReports statistics on the pooled and cached connections in the sharded connection pool.
ExampleIn MongoDB shell, let us issue the command:

db.runCommand( { shardConnPoolStats: 1 } )

04.SHARDCONNPOOLSTATS

Referencehttp://docs.mongodb.org/manual/reference/command/shardConnPoolStats/

shardConnPoolStats

 

CommandconnPoolStats
DescriptionReports statistics regarding the number of open connections to the current database instance, including client connections and server-to-server connections for replication and clustering.
ExampleIn MongoDB shell, let us issue the command:

db.runCommand( { connPoolStats: 1 } )

04.CONNPOOLSTATS

Referencehttp://docs.mongodb.org/manual/reference/command/connPoolStats/

connPoolStats

 

Commandisdbgrid
DescriptionThis command verifies that a process MongoDB shell is connected to is a mongos.
ExampleIn MongoDB shell, let us issue the command:

db.runCommand( { isdbgrid: 1 } )

04.ISDBGRID

Referencehttp://docs.mongodb.org/manual/reference/command/isdbgrid/

isdbgrid

 

CommandflushRouterConfig
DescriptionClears the current sharded cluster information cached by a mongos instance and reloads all sharded cluster metadata from the config database. It should be issued against mongos process instance and run in context of admin database.
ExampleIn MongoDB shell, let us issue the command:

db.adminCommand( { flushRouterConfig: 1 } )

04.FLUSHROUTERCONFIG

Referencehttp://docs.mongodb.org/manual/reference/command/flushRouterConfig/

flushRouterConfig

 

CommandshardingState
DescriptionReports if the current instance of MongoDB server is a member of a sharded cluster. It should be run in context of admin database.
ExampleLet us connect to any of the standalone MongoDB server instances using MongoDB shell:

bin/mongo --port 27001

And issue the command:

db.adminCommand( { shardingState: 1 } )

04.SHARDINGSTATE

Referencehttp://docs.mongodb.org/manual/reference/command/shardingState/

shardingState

 

CommandmovePrimary
Parameters
{ 
    movePrimary : <database>, 
    to : <shard name> 
}
DescriptionThe command reassigns the database’s primary shard, which holds all unsharded collections in the database, to another shard in the sharded cluster. It should be issued against mongos process instance and run in context of admin database.
ExampleIn MongoDB shell, let us issue the command:

db.adminCommand
( { movePrimary : "test", to : "shard0001" } )

04.MOVEPRIMARY

Referencehttp://docs.mongodb.org/manual/reference/command/movePrimary/

movePrimary

 

Commandsh.help()
DescriptionOutputs the brief description for all sharding-related shell functions.
ExampleIn MongoDB shell, let us issue the command:

sh.help()

04.SH.HELP

Referencehttp://docs.mongodb.org/manual/reference/method/sh.help/

sh.help()

 

CommandaddShard
Parameters
{ 
    addShard: <hostname><:port>, 
    maxSize: <size>, 
    name: <shard_name> 
}
Wrappersh.addShard(<host>)
DescriptionThe command adds either a database instance or a replica set to a sharded cluster. It should be issues against mongos process instance.
ExampleSee please Configuring Sharded Cluster section.
Referencehttp://docs.mongodb.org/manual/reference/command/addShard/

http://docs.mongodb.org/manual/reference/method/sh.addShard/

addShard

 

Commanddb.printShardingStatus()

sh.status()

DescriptionThe command outputs a formatted report of the sharding configuration and the information regarding existing chunks in a sharded cluster. The default behavior suppresses the detailed chunk information if the total number of chunks is greater than or equal to 20.
ExampleSee please Configuring Sharded Cluster section.
Referencehttp://docs.mongodb.org/manual/reference/method/sh.status/

db.printShardingStatus()

sh.status()

 

CommandenableSharding
Parameters
{ 
    enableSharding: <database> 
}
Wrappersh.enableSharding(database)
DescriptionEnables sharding on the specified database. It does not automatically shard any collections but makes it possible to begin sharding collections using sh.shardCollection() command.
ExampleSee please Sharding databases and collections section.
Referencehttp://docs.mongodb.org/manual/reference/command/enableSharding/

http://docs.mongodb.org/manual/reference/method/sh.enableSharding/

enableSharding

 

CommandshardCollection
Parameters
{ 
    shardCollection: <database>.<collection>, 
    key: <shardkey> 
}
Wrappersh.shardCollection(namespace, key, unique)
DescriptionThe command enables sharding for a collection and allows MongoDB to begin distributing data among shards. Before issuing this command, the sharding should be enabled on database level using sh.enableSharding() command.
ExampleSee please Sharding databases and collections section.
Referencehttp://docs.mongodb.org/manual/reference/command/shardCollection/

http://docs.mongodb.org/manual/reference/method/sh.shardCollection/

shardCollection

 

Commandsh.getBalancerHost()
DescriptionThe command returns the name of a mongos responsible for the balancer process.
ExampleIn MongoDB shell, let us issue the command:

sh.getBalancerHost()

04.SH.GETBALANCERHOST

Referencehttp://docs.mongodb.org/manual/reference/method/sh.getBalancerHost/

sh.getBalancerHost()

 

Commandsh.getBalancerState()
DescriptionThe command returns true when the balancer is enabled and false if the balancer is disabled.
ExampleIn MongoDB shell, let us issue the command:

sh.getBalancerState()

04.SH.GETBALANCERSTATE

Referencehttp://docs.mongodb.org/manual/reference/method/sh.getBalancerState/

sh.getBalancerState()

 

Commandsh.isBalancerRunning()
DescriptionThe command returns true if the balancer process is currently running and migrating chunks and false if the balancer process is not running.
ExampleIn MongoDB shell, let us issue the command:

sh.isBalancerRunning()

04.SH.ISBALANCERRUNNING

Referencehttp://docs.mongodb.org/manual/reference/method/sh.isBalancerRunning/

sh.isBalancerRunning()

 

CommandsetBalancerState (true | false)
DescriptionThe command enables or disables the balancer in the sharded cluster.
ExampleIn MongoDB shell, let us issue the command:

sh.setBalancerState(false)

04.SH.SETBALANCERSTATE

Referencehttp://docs.mongodb.org/manual/reference/method/sh.disableBalancing/

setBalancerState (true | false)

 

Commandsh.startBalancer(timeout, interval)
DescriptionThe command enables the balancer in a sharded cluster and waits for balancing to start.
ExampleIn MongoDB shell, let us issue the command:

sh.startBalancer(5000, 100)

04.SH.STARTBALANCER

Referencehttp://docs.mongodb.org/manual/reference/method/sh.startBalancer/

sh.startBalancer(timeout, interval)

 

Commandsh.stopBalancer(timeout, interval)
DescriptionThe command disables the balancer in a sharded cluster and waits for balancing to finish.
ExampleIn MongoDB shell, let us issue the command:

sh.stopBalancer(5000, 100)

04.SH.STOPBALANCER

Referencehttp://docs.mongodb.org/manual/reference/method/sh.stopBalancer/

sh.stopBalancer(timeout, interval)

 

Commanddb.<collection>.getShardDistribution()
DescriptionOutputs the data distribution statistics for a sharded collection in the sharded cluster.
ExampleIn MongoDB shell, let us issue the commands:

use bookstore
db.books.getShardDistribution()

04.DB.COLLECTION.GETSHARDDISTRIBUTION

Referencehttp://docs.mongodb.org/manual/reference/method/db.collection.getShardDistribution/

db..getShardDistribution()

 

CommandremoveShard
Parameters
{
    removeShard: <shard name>
}
DescriptionThe command removes a shard from a sharded cluster. Upon run, MongoDB moves the shard’s chunks to other shards in the cluster and only than removes the shard. It should be run in context of admin database.
ExampleIn MongoDB shell, let us issue the command:

db.adminCommand( { removeShard: "shard0002" } )

04.REMOVESHARD

Referencehttp://docs.mongodb.org/manual/reference/command/removeShard/

removeShard

 

CommandmergeChunks
Parameters
{ 
    mergeChunks : <database>.<collection> ,
    bounds : [ 
        { <shardKeyField>: <minFieldValue> },
        { <shardKeyField>: <maxFieldValue> } 
    ]
}
DescriptionThe command combines two contiguous chunk ranges the same shard into a single chunk. At least one of chunk must not have any documents. The command should be issued against mongos instance and run in context of admin database.
Referencehttp://docs.mongodb.org/manual/reference/command/mergeChunks/

mergeChunks

 

Commandsplit
Parameters
{ 
    split: <database>.<collection>,
    <find|middle|bounds> 
}
Wrappersh.splitAt(namespace, query)

sh.splitFind(namespace, query)

DescriptionThe command manually splits a chunk in a sharded cluster into two chunks. The command should be run in context of admin database.
ExampleIn MongoDB shell, let us issue the command:

db.adminCommand( { split: “bookstore.books”, find: { “title”: “MongoDB in action” } } )

04.SPLIT

Alternatively, let us run the same command using command wrapper:

sh.splitFind( “bookstore.books”, { “title”: “MongoDB in action” } )

Referencehttp://docs.mongodb.org/manual/reference/command/split/

http://docs.mongodb.org/manual/reference/method/sh.splitAt/

http://docs.mongodb.org/manual/reference/method/sh.splitFind/

split

 

Commandsh.moveChunk(namespace, query, shard name)
DescriptionThe command moves the chunk that contains the document specified by the query to another shard with name shard name.
ExampleIn MongoDB shell, let us issue the commands:

use bookstore
sh.moveChunk(   "bookstore.books", 
{ "title": "MongoDB in action" }, "shard0001" )
db.books.getShardDistribution()

04.SH.MOVECHUNKS

Referencehttp://docs.mongodb.org/manual/reference/method/sh.moveChunk/

sh.moveChunk(namespace, query, shard name)

 

Commandsh.addShardTag(shard, tag)
DescriptionThe command associates a shard with a tag or identifier. MongoDB uses these identifiers to direct chunks that fall within a tagged range to specific shards.
ExampleSee please Shard tags section.
Referencehttp://docs.mongodb.org/manual/reference/method/sh.addShardTag/

sh.addShardTag(shard, tag)

 

Commandsh.addTagRange(namespace, minimum, maximum, tag)
DescriptionThe command attaches a range of shard key values to a shard tag created previously using the sh.addShardTag() command.
ExampleSee please Shard tags section.
Referencehttp://docs.mongodb.org/manual/reference/method/sh.addTagRange/

sh.addTagRange(namespace, minimum, maximum, tag)

 

Commandsh.removeShardTag(shard, tag)
DescriptionThe command removes the association between a tag and a shard. The command should be issued against mongos instance
ExampleSee please Shard tags section.
Referencehttp://docs.mongodb.org/manual/reference/method/sh.removeShardTag/

sh.removeShardTag(shard, tag)

 

8. What’s next

In this part we have covered basics of MongoDB sharding (partitioning) capabilities. For more complete and comprehensive details please refer to official documentation. In the next section we are going to take a look on MongoDB replication.

Andrey Redko

Andriy is a well-grounded software developer with more then 12 years of practical experience using Java/EE, C#/.NET, C++, Groovy, Ruby, functional programming (Scala), databases (MySQL, PostgreSQL, Oracle) and NoSQL solutions (MongoDB, Redis).
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments
Back to top button