MYISAM sharding vs using InnoDB

Question

I have a table with very high insert rate and update rate as well as read rate. On average there are about 100 rows being inserted and updated per second. And there are about 1000 selects per second.

The table has about 100 million tuples. It is a relationship table so it only has about 5 fields. Three fields contain keys so they are indexed. All the fields are of integers.

I am thinking of sharding the data, however, it adds a lot of complexity, but does offer speed. The other alternative is to use innodb.

The database runs on a raid 1 of 256GB ssd with 32GB 1600mhz of RAM and i7 3770k over clocked to 4Ghz

The database freezes constantly at peak times where the queries can be as high as 200 rows being inserted or updated and 2500 selects per second

Could you guys please point into as what I should do?

tazer84 · Accepted Answer · 2012-12-15T02:04:54.193

Sharding is usually a good idea to distribute table size. Load problems should generally be addressed with a replicated data environment. In your case your problem is a) huge table and b) table level locking and c) crappy hardware.

InnoDB

If you can use one of the keys on your table as a primary key, InnoDB might be a good way to go since he'll let you do row-level locking which may reduce your queries from waiting on each other. A good test might be to replicate your table to a test server and try all your queries against him and see what the performance benefit is. InnoDB has a higher resource consumption rates then MyISAM, so keep that in mind.

Hardware

I'm sorry bud, but your hardware is crap for the performance you need. Twitter does 34 writes per second at 2.6k QPS. You can't be doing Twitter's volume and think a beefed up gaming desktop is going to cut it. Buy a $15k Dell with some SSD drives and you'll be able to burst 100k QPS. You're in the big times now. It's time to ditch the start-up gear and get yourself a nice server. You do not want to shard. It will be cheaper to upgrade your hardware, and frankly, you need to.

Sharding

Sharding is awesome for splitting up large tables. And that's it.

Let me be clear about the bad. Developing a sharded architecture sucks. You want to do everything possible to not shard. Upgrade hardware, buy multiple servers and set up replication, optimize your code, but for the love of God, do not shard. You are way below the performance line for sharding. When your pushing sustained 30k+ QPS, then we can talk sharding. Until that day, NO.

You can buy a medium-range server ($30k Dell PowerEdge) with 5TB of Fusion IO on 16 cores and 256 GB of RAM and he'll take you all the way to 200k QPS.

But if you refuse to listen to me and are going to shard anyway, then here's what you need to do.

Rule 1: Stay on the Same Shard (ie. Picking a Partition Rule)

Once you shard, you do not want to be accessing data from across multiple shards. You need to pick a partition rule that keeps your query on the same shard as much as possible. Distributing a query (rule 4) is incredibly painful in distributed data environments.

Rule 2: Build a Shard Map and Replicate it

Your code will need to be able to get to all shards. Create a shard map based on your partition rule that lets your code know where to go to get the data he wants.

Rule 3: Write a Query Wrapper for your Shards

You do not want to manually decide which shard to go to. Write a wrapper that does it for you. You will thank yourself down the road when you're writing code.

Rule 4: Auto-balance

You'll eventually need to balance your shards to keep performance optimal. Plan for this before-hand and write your code with the intention that you'll have some kron job which balances your shards for you.

Rule 4: Support Distributed Queries

You inevitably will need to break Rule 1. When that happens, you'll need a query wrapper that can pull data from multiple shards and aggregate (bring) it into one place. The more shards you have, the more likely this will need to be multi-threaded. In my shop, we call this a distributed query (ie. a query which runs on multiple shards).

Bad News: There is no code out there for doing distributed queries and aggregating results. Apache Hadoop tries, but he's terrible. So is HiveDB. A good query distributor is hard to architect, hard to write, hard to optimize. This is a problem billion-dollar a year companies deal with. I shit you not, but if you come up with a good wrapper for distributing queries across shards that supports sorting+limit clauses and scales well, you could be a millionaire over night. Selling it for $300,000? You would have a line outside your door a mile long.

My point here is sharding is hard and it is expensive. It takes a lot of work and you want to do everything humanly possible to not shard. If you must, follow the rules.

First of all, I cannot appreciate more the answer you have written for me. I really like it. We are on the brink of upgrading, however, I wanted to know what can be done straight away. I must admit that the hardware is shit. This is the only table on this server. Do you think if we replicated this master to 5 slaves and distributed the read load to them, would it make much difference? Is it possible and plausible? Also, if we go for if don't shard but upgrade the hardware, and still have trouble with the data? Would Innodb scale? Could there be a point where we even need to shard the Innodb? — Vish, Dec 15 '12 at 06:13
@user658911 If you distribute the read load across multiple slaves, yes you'll definitely see a performance improvement, but because of MyISAM, your inserts will still lock your reads out of the table. Then there's also the problem of rebuilding indexes everytime you insert/update 1 row. What you really want is row-level locking (InnoDB) or some kind of batching for your inserts. — tazer84, Dec 15 '12 at 20:08
@user658911 (continued). Easiest thing would be to move to replication + innodb. If you need to stick with MyISAM, lag your slaves a fewh ours behind master (so heavy inserts go in at off-peak hours on the slave) or implement some kind of caching to do your inserts in bulk. So maybe have the inserts write to a non-replicated copy of your table on master, then have a kron job that every hour moves the inserts from the non-replicated table to the replicated one. The kron job can disable indexes, batch insert to the replicated table, rebuild indexes and still have (relatively) real-time data. — tazer84, Dec 15 '12 at 20:11
The only issue I am worried about on innodb is the number of reads I require. How slow is innodb compared to myisam when it comes to reads. Also, I have heard some good things about cassandra? Do you think it can be a possible solution in this case? — Vish, Dec 15 '12 at 23:32
@user658911 InnoDb actually is better then MyISAM for reads. The (extremely knowledgable) guys at Percona ran a benchmark and found that InnoDB was the way to go for SELECTs. http://www.mysqlperformanceblog.com/2007/01/08/innodb-vs-myisam-vs-falcon-benchmarks-part-1/ — tazer84, Dec 16 '12 at 05:32