This post covers a brief introduction into database sharding including a bit of theory and a sample project. As a database server I selected RavenDB, primarily because it is an impressive project by itself and I just wanted to play about with it. The concept however can be applied to any other server either relational or non-relational.
The problem of high load
It is common that some projects generate much of attention and as result may receive high load: large number of users, huge network traffic, billions of bank transactions etc. Each software operation has its cost obviously, but the hardware resources have very defined limitations. Therefore there will be a point where hardware can no longer cope with numerous requests.
The easiest approach to deal with high load is to do vertical scaling, i.e. add more hardware: memory, processing power, disk space. This may make things run faster for a while. Unfortunately, if the load continue to increase, vertical scaling is not the cure and more scalable solution is needed.
Vertical scaling assumes that there is one very powerful server, the beast. In contrast, horizontal scaling promotes usage of many inexpensive machines. The load distribution is the key concept of horizontal scaling, and database sharding is one of the techniques of horizontal load balancing.
Sharding can be defined as a procedure of breaking large databases into smaller independent ones. The idea is to denormalize data and split a highly loaded database into several independent chunks. Denormalization allows us to make chunks of a database independent, thus a single query can hit only one database shard. From the other side, denormalization introduces data duplication, as same information may need to be inserted into several shards to make data consistent. You see there is always a balance between consistency and scalability.
For more information on database sharding concepts see an article on code futures.
To demonstrate sharding technique with NoSQL database, I created a very simplified model of Twitter. In my model there are users and tweets. Each user has a dynamic array of tweets, so he/she can add or remove text messages. Assume that in order to provide consistently fast user experience one database can handle 10 users at most. This means I should store 10 users per database. All my users have an id which is a global counter, based on this counter I can store users with id 1 - 10 on a database shard 1, users 11 - 20 I store on shard 2 and so on. In this sample I will have 3 shards and provide support for 30 users.
This is a step-by-step guide to get the sample working
- Running servers. To start Raven DB servers, you can run
This shall open three console application, one for each RavenDB server
- Running client
- Either open solution in Visual Studio 2010 and run the project
- Run the binary client directly from
- Take a look at the RavenDB servers to see which request goes to which server.
Users are stored in the following shards:
Users with id 1 - 10 are stored in shard "user_1_10"
Users with id 11 - 20 are stored in shard "user_10_20"
Users with id 21 - 30 are stored in shard "user_20_30"
RavenDB API notes
To load all users from all shards, RavenDB provides same API as if there is only one database, thanks to interface-oriented design. Developer doesn't need to know the actual location of each user (be it shard 1 or shard N). Internally RavenDB understands that IDocumentStore is a ShardedDocumentStore and picks up the implementation of IShardSelectionStrategy. In this sample to resolve the location of the user, RavenDB checks user's id, which is uniquely mapped to a shard.
If I want to retrieve data for defined user, I will use exactly the same code as if there were no shards. The whole idea is to make the client side code unaware where the item is stored. Simplicity is beauty.
Sample output (trimmed)
Client first generates sample data using FizzWare.NBuilder. It then saves it into 3 RavenDB shards. Lastly, client queries the shards and displays all users and all tweets of the one randomly selected user.
Subscribes count: 1
Tweets count: 24
Subscribes count: 28
Tweets count: 25
Subscribes count: 4
Tweets count: 9
Tweets by FirstName2 LastName2
dolore magna et et dolor amet elit lorem lorem amet dolor magna consectetur et a
met consectetur tempor et
Expected output of the three shards follow below
Note of advice
- Get all records is a killer. Client side code as well as server code must be safe by default. For example RavenDB server by design returns 1024 objects at most and client side only allows 128 objects. If user needs more data, a paged version shall be used (Take and Skip). After all any search engine doesn't return you all the results for a query, so why should the user be stormed with billions of objects?
- IO operations are expensive. If possible no interactions shall be made with hard drives. For this one can
- use caching
- hold all database in-memory
- RavenDB is an actively developed project, every so often braking changes are introduced so this sample project may not work with versions later than 800.
Download this project
This project is making use of several open source projects. Corresponding acknowledgement and licences are provided with the download.
From this web site:
RavenDbSharding.zip (15.22 mb) [Downloads: 200]
Or get it from github:
If you enjoyed this post, make sure you subscribe to my RSS feed!