11

it seems that it is a common practice to divide the data of one table into many databases, many tables to improve performance, i can understand the many databases part, because more databases provides more CPUS, more memories, more IO capacity. but many tables? why not just use mysql partitions http://dev.mysql.com/doc/refman/5.1/en/partitioning.html?

update: i dont mean normalization. i mean divide a table of N records into e.g. 10 tables each of the small table have N/10 records

update2: thanks @Johan for the clarification of sharding and partition, especially point out the hot property of the data.

The small question @Johan did not answer is: for a simple example, lets say we have a user table, it has a userid column(bigint). I think it is easier to use mysql-partition to divide the table into partitions based on userid automatically, there seems no benifit to divide the table into small tables manually(based on userid), am i right?

James.Xu
  • 8,249
  • 5
  • 25
  • 36
  • Where are you seeing this done? I don't think what you're talking about (subdividing tables based on rowcount) is a common practice at all. – Scott C Wilson May 31 '11 at 13:28
  • Not off topic, it's would be better as a CW, but the why of partitioning is definitely programming related. Many people ask `how do I partition` (on-topic) and them the answer is infallibly `don't do it because you don't need to` indicating that people have forgotten to ask (themselves/others) `(why) should I partition?` For this reason it's good to have a discussion question on this subject where SO's can see the pros and cons of partitioning. – Johan May 31 '11 at 13:29
  • Yes, exactly right: manually partitioning tables is madness because you'll also have to manually put them back together with a `union` or a `join` if you want to query both datasets. If you use a partitioning function MySQL does all the work for you. This means partitioning is transparent to your application and your code does not break. win-win. – Johan Jun 01 '11 at 23:06

2 Answers2

31

I think you have a few terms mixed up here.

All your data goes into one database (aka schema). In a database you can have tables.

e.g.

table employee
   id integer
   name varchar
   address varchar
   country varchar

table office
   id integer
   employee_id integer
   address varchar

Inside tables you have fields (id, name, address) aka columns. And tables have one or more rows.
An example for table employee:

id  name        address           country
----------------------------------------------------
1   John        1 Regent Street   UK
2   James       24 Jump Street    China
3   Darth Vader 1 Death Star      Bestine, Tatooine

So much for the basics.

Why partitioning
Now suppose that we have lots and lots of people (rows) in our database.
Remember this a galactic database, so we have 100 billion records.
If we want to search trough this fast it's nice if we can do this in parallel.
So we partition the table (say by country) and then we can have x servers looking in 1 country each.
Partitioning across servers is called sharding.

Or we can partition e.g. historical data by year, so we don't have to go through all the data just to get the recent news. We only have to go through the partition for this year. This is called partitioning.

What's the big difference between sharding can just partitioning?

Sharding
In sharding you anticipate that all your data is relevant, and equally likely to be queried. (e.g. google can expect all their data to be queried; archiving part of their data is useless for them).
In this case you want lots of machines to look though your data in parallel, where each machine does part of the work.
So you give each machine a different partition (shard) of the data and give all the machines the same query. When the results come out you UNION them all together and output the result.

Basic partitioning
In basic partitioning part of your data is hot and part is not. A typical case is historical data, the new data is hot, the old data hardly gets touched.
For this use case it is pointless to put the old data in separate servers. Those machines will just wait and wait and do nothing because nobody cares about the old data except some auditors who look at it once a year.
So you partition that data by year and the server will automatically archive the old partitions so your queries will only look at one (maybe 2) years of data and be much faster.

Do I need partitioning?
You only do partitioning when you have lots and lots of data, because it complicates your setup.
Unless you have more than a million records you don't have to consider partitioning.*)
If you have more than a 100 million records, you should definitely consider it.*)

For more info see: http://dev.mysql.com/doc/refman/5.1/en/partitioning.html
and: http://blog.mayflower.de/archives/353-Is-MySQL-partitioning-useful-for-very-big-real-life-problems.html
See also wiki: http://en.wikipedia.org/wiki/Partition_%28database%29


*) These are just my personal heuristics YMMV.

Johan
  • 74,508
  • 24
  • 191
  • 319
  • 2
    thank you, you helped me better understanding the sharding and partitioning, especially make me put whether the data is **hot** into consideration. And i have read through the article: http://blog.mayflower.de/archives/353-Is-MySQL-partitioning-useful-for-very-big-real-life-problems.html, it mentions some limitations of the mysql-partition. In my opinion, for a simple example, lets say we have a user table, it is easier to use mysql-partition to divide the table into partitions based on user_id, rather than divide the table into small tables manually. -- because mysql do all the thing – James.Xu May 31 '11 at 14:19
  • 2
    I don't even care about databases and I found this interesting – Mike Robinson May 31 '11 at 14:30
0

Data is split into smaller tables to 'normalize it'. This is a very interesting concept. You may read more on it here.

http://en.wikipedia.org/wiki/User:Jaseemabid/Books/Database_normalisation

A quick example.

Assume a small phonebook app, allowing people to have multiple numbers.

One way of design would be like this

  • Name |Number
  • A | 123
  • A | 95467
  • B | 179

The problem with this is that when we have to update the name of A and if we dont update all , it will cause confusion. So we can split this into two tables like this.

  • Unique ID | name
  • 1 | A
  • 2 | B

  • Unique ID | number

  • 1 | 123
  • 1 | 95467
  • 2 | 179

This will solve the issue. constrains can be handled in an awesome manner using "foreign keys" , please read abt it to understand the whole concept properly.

Hope you get it :)

Jaseem
  • 2,236
  • 6
  • 28
  • 35
  • thank you, but you misunderstand me, i dont mean normalization. i mean divide a table of **N** records into e.g. **10** tables each of the small table have **N/10** records. – James.Xu May 31 '11 at 13:24
  • 1
    partitioning **is not** normalization, please read: http://en.wikipedia.org/wiki/Partition_%28database%29 and: http://en.wikipedia.org/wiki/Database_normalization – Johan May 31 '11 at 13:43