VN: MySQL Fabric - Sharding

Introduction

Enterprises often start with a single server setup.

As the enterprise grows, so does the data and the number of requests for the data. Using a single server setup makes it difficult to manage the increasing load. This creates the requirement to scale out.

One solution would be to replicate to read servers. The writes go to the master and the reads go to the slaves.

Although this setup handles the increased read load, it still cannot handle the increasing write load. Trying to handle this by adding another master just compounds the problem. The write load must be repeated on every master (more work for the application and each master is just as busy as if there were just one).

Thus we need to tackle the problem in a different way. Instead of focusing on our setup we shift focus to our data. By splitting the data we can correspondingly distribute the load on the data.

Fabric Sharding allows the distribution of rows from a MySQL table over multiple

database servers or physical locations.

MySQL Fabric Sharding - Basic Architecture

The basic architecture consists of the,

Fabric Aware Connector
Fabric Server
Server Groups - That store the shard data.

MySQL Fabric Aware Connector

The connector contacts the Fabric server to find the location of a particular shard key. The connector also implements caching to avoid repeated round trips to the Fabric server to fetch the sharding information. By building the functionality into the connector, the architecture avoids the need to for an extra hop to a proxy/routing node and so minimizes latency. In the future we may add the option of a standalone routing node to enable Fabric-unaware connectors work with MySQL Fabric.

MySQL Fabric Server

The Fabric server,

Stores the sharding information
Range Based Sharding
Hash Based Sharding
Handles Re-Sharding
Shard Moving
Shard Splitting

Server Groups - Storing the shard data

Each server group represents an HA configuration of MySQL Servers, with one of the servers acting as master and the others acting as slaves.

Sharding Schemes

The data can basically be partitioned using the RANGE based or HASH based sharding scheme. In the current version, the application must provide the sharding key but this is something that we aim to make more transparent in the future.

Range Based Sharding

The Range based sharding scheme partitions a given table based on, user defined ranges of the value of a column, chosen as the sharding key.

For example, in the following Employee table schema,

EmpID	EmpName	EmpAddress	EmpPhone

The table can be partitioned on EmpID. Let us assume that the EmpID ranges between values 0 – 10000. We can create five partitions with lower_bounds = 0, 201, 401, 601,801. Where the partitions with lower_bound = 0 stores all the rows with EmpID between 0 – 200. Similarly the partition with lower_bound = 201 stores all the rows with EmpID between 201 – 400.

Hash Based Sharding

The Hash based sharding scheme uses the md5 cryptographic hash and builds on the underlying RANGE based sharding scheme. The lower_bounds for the underlying RANGE based sharding scheme are generated by applying the cryptographic hash on the group ID for the particular shard.

In order to decide whether a given row of a table must be in a particular shard, we apply the md5 algorithm on the value of the column chosen as the shard key. We compare this value with the lower_bound of the shard definition to decide if the row should be placed in this shard.

Hence in the case of the employee table schema, assume that we have a FABRIC group with ID as GROUPID-1. We set lower_bound = md5(“GROUPID-1”). Now when we want to insert a row into this shard, we check to see if md5(EmpID for row) > md5(“GROUPID-1”) before inserting.

Handling Global Operations

In order to handle transactions that need to be executed on all the shards, E.g. Schema updates / updates on global tables, FABRIC creates the concept of a global group.

The Global Group stores the schema (IMPORTANT: Not the data) of the tables that are sharded. It also stores the tables (IMPORTANT: And the data) for all such tables that need to be present on all the shards (E.g. Pincode table that might need to participate in joins across all the shards).

When we need to fire a global operation, for example a schema update on a sharded table, we run this on the global group. The update automatically gets propagated across all the shards.

Moving the Shard

When a given shard is overloaded, we may decide to move the shard into a new Fabric Group (For e.g. The new Fabric Group could be composed of servers running on more powerful hardware). While the shard move takes place we would like to ensure minimal impact on ongoing transactions.

Fabric allows moving a shard registered with the Fabric system into another Fabric group seamlessly.

Splitting the Shard

When the data in a shard grows larger than the size optimal for good performance, we can split the data in this shard into 2 server groups. While shard split takes place we would like to ensure minimal impact on ongoing transactions. Fabric allows splitting a shard registered with the Fabric system into another Fabric group seamlessly.

Find an use case for starting with a single server (single shard setup) and scaling gradually as the application and its data grows here http://vnwrites.blogspot.in/2013/09/mysqlfabric-sharding-example.html .

VN

Saturday, September 21, 2013

MySQL Fabric - Sharding - Introduction