19

I need to save around 7.8 billion records in a Mysql table. The Table is both read and write intensive. I have to maintain at least 0.02 billion records per hour of insertion rate. While search on the table should not take more than 10 seconds. We have a UI from where users can search based on different colums property.

Mostly searched query can be like:

  1. select * from mytable where prop1='sip:+100008521149' and prop2='asdsa' order by event_timestamp desc limit 10;

  2. select * from mytable where prop1='sip:+100008521149' order by event_timestamp desc limit 10;

  3. select * from mytable where prop2='asdsa' order by event_timestamp desc limit 10;

Currently there are 2 indexes on table:

1- idx_1(prop1,event_timestamp)
2- idx_2(prop2,event_timestamp)

InnoDB settings are as follows:

    innodb_buffer_pool_size = 70G
    innodb_log_file_size = 4G
    innodb_io_capacity=2000
    innodb_io_capacity_max=6000
    innodb_lru_scan_depth=2000
    innodb_flush_log_at_trx_commit=2
    innodb_log_buffer_size=16M
    innodb_thread_concurrency = 0
    innodb_read_io_threads = 64
    innodb_write_io_threads = 64
    innodb_autoinc_lock_mode = 2
    bulk_insert_buffer_size=33554432
    query_cache_type=1
    query_cache_size=64M
    innodb_flush_neighbors=0
    expire_logs_days=10
    max-connections=500
    long_query_time = 5
    read_buffer_size=16M
    sort_buffer_size=16M
    read_rnd_buffer_size=16M
    innodb_doublewrite = 0
    innodb_flush_method=O_DIRECT

Machine's RAM size is 99 GB.

Once started system was fast but performance reduces a lot when record reached to 0.22 billion. Though we are using LOAD INFILE,insertion rate was very slow.Search was pretty fast while searching on indexed params. Looks like buffer pool is not enough.

I have few question:

  1. Is this possible to support this kind of data with this config.

  2. What should be the ideal and practical buffer pool size for 7 billion records.

  3. DATA+INDEX size is coming close to 150 GB with only 0.22 billion records. Looks like I need to have TB's of ram.
  4. We are thinking of Master/Slave config to make congif for read and write dominant on respective servers.
  5. Any other better way to design this solution?
  6. Increasing more indexes makes UI search better but increasing single an index is reducing insertion speed many folds.

UPADTE:1

Q- The table is much bigger than RAM, correct? The buffer_pool cannot be made big enough -- it must be smaller than ram, or else performance suffers.

A- RAM size is 100 GB, Buffer pool is 70 G. Yes the data size is too big than RAM.

Q- Please provide SHOW CREATE TABLE; there are several issues I need to investigate. (datatypes, field sizes, etc)

A- All the fields are string type. we have used varchar(127) for all. PK is autogenerated id bigint(20).

Q- How many records in the LOAD DATA INFILE? Do you LOAD directly into the table? How often is the LOAD?

A- 100000 records per file. Multiple threads are loading data from CSV file to DB. In the initial migration, we have to load it continuously till 0.65 billion records. After that frequency will reduce to around per 15 minutes.

Q- Master+Slave: Keep in mind that all writes are performed on the Slave, too. If you have lots of reads, then more than one Slave would spread the reads around, thereby getting some scaling.

A- We are testing currently with MASTER/SLAVE approach.

We made MASTER with MYISAM and no indexes. MASTER will be used for inserts. SLAVE having INNODB and with 2 indexes. The search will be performed on it. Both are different machines and not sharing RAM or CPU. The application is on the third machine.

Q- Do you have spinning drives? Or SSDs? A- How to check it?

Q- Your rows seem to be quite large. Are there TEXTs or BLOBs? If so, SELECT * may be a serious performance burden.

A- yes rows have 50 columns but data is in around 15-20 columns. We can't reduce the size of datatypes as all fields can hold any amount of alphanumeric data. All are TEXTS no BLOBS.

Rajiv Pandey
  • 13
  • 1
  • 6
Mudit bhaintwal
  • 528
  • 1
  • 7
  • 21
  • 1
    I believe you want to have a `(prop1,prop2,event_timestamp)` index for that query. – zerkms Jul 07 '16 at 05:25
  • can you please post me the output from **SELECT * FROM mytable PROCEDURE ANALYSE();** – Bernd Buffen Jul 07 '16 at 05:27
  • @zerkms: Please see updates in question on Indexes. – Mudit bhaintwal Jul 07 '16 at 05:37
  • @Bernd Buffen: query still running for past 40 minutes. – Mudit bhaintwal Jul 07 '16 at 06:13
  • @Mudit - van you kill it tun Ägäis with a Limit 10000 After the tablename – Bernd Buffen Jul 07 '16 at 06:30
  • Since you do a `ORDER BY event_timestamp LIMIT` I would try to reverse the order of the fields in the indexes: `idx_X(event_timestamp,propX)` - not sure it makes a difference (based on MySQL being clever), but worth a try with so many rows... (if it's not in production, restart the server to be sure nothing remains from previous attempts/tests) – Déjà vu Jul 07 '16 at 06:41
  • @ringø , that would likely not work. Such an index would be very hard to use in the OP's WHEREs. Index keys should be first those used in WHERE clauses, then (to get a covering index) those used in the SELECT. – LSerni Jul 07 '16 at 07:05
  • @BerndBuffen- All the suggested data type is ENUM of values inside table. – Mudit bhaintwal Jul 07 '16 at 07:38
  • 7.8 billion rows today, and you're adding 0.2 billion rows for each (work) day?!? That's one billion new rows each work week. 50 billion rows in a year. – jarlh Jul 07 '16 at 07:48
  • @ringø - Reversing the order will hurt, not help. MySQL is happy to go `DESC` (in most situations - including the current one). – Rick James Jul 08 '16 at 21:57
  • Do you really need `SELECT *`? Or can you limit the return to some subset of the rows? From that, we may find improvement in _vertical_ partitioning. – Rick James Jul 08 '16 at 22:10
  • @RickJames Keep `DESC`, but reverse the order of fields, ie `event_timestamp,propX`! – Déjà vu Jul 08 '16 at 23:17

3 Answers3

5

Turn off the query cache: It must purge all entries in the QC each time an INSERT occurs -- that's 5555 times a second!

query_cache_type = 0
query_cache_size = 0

The first query needs INDEX(prop1, prop2, event_timestamp). (The prop1 and prop2 can be swapped.)

With that added index, each of the three queries will touch no more than 10 rows in the index, and do no more than 10 random(?) fetches into the data. At worst, that is only about 11 disk hits. And @Bernd's 'lazy eval' will not make it any better.

The table is much bigger than RAM, correct? The buffer_pool cannot be made big enough -- it must be smaller than ram, or else performance suffers.

Please provide SHOW CREATE TABLE; there are several issues I need to investigate. (datatypes, field sizes, etc)

How many records in the LOAD DATA INFILE? Do you LOAD directly into the table? How often is the LOAD?

Master+Slave: Keep in mind that all writes are performed on the Slave, too. If you have lots of reads, then more than one Slave would spread the reads around, thereby getting some scaling.

Do you have spinning drives? Or SSDs?

Your rows seem to be quite large. Are there TEXTs or BLOBs? If so, SELECT * may be a serious performance burden.

Rick James
  • 135,179
  • 13
  • 127
  • 222
5

I achieved this requirement by replacing MYSQL DB with Elasticsearch. It looks a perfect fit for fast insertion rate and damn fast search. Moreover, full-text capabilities of Lucene make it a perfect tool. The best part of ES is that it has very low hardware requirements. It scales horizontally rather than vertically.

Mudit bhaintwal
  • 528
  • 1
  • 7
  • 21
2

This is not the Answer, but i cant format it in comment

can you please try this to see if it faster. so MySQL must not sort the hole rows only the id (Primary KEY)

SELECT r.*
FROM (
  SELECT id
  FROM mytable
  WHERE 
      prop1='sip:+100008521149'
    AND
      prop2='asdsa'
  ORDER BY event_timestamp DESC
  LIMIT 10
) AS r
LEFT JOIN mytable m ON m.id =r.id
ORDER BY r.event_timestamp DESC;
Bernd Buffen
  • 14,525
  • 2
  • 24
  • 39
  • But, in absence of explicit `order by`, the ordering is undefined. – Sergio Tulentsev Jul 07 '16 at 07:04
  • @Sergio Tulentsev - correct ! , you can add it again at the end of the query then only 10 row to be sorted – Bernd Buffen Jul 07 '16 at 07:10
  • 1
    This needs to be benchmarked. I have a strong feeling that mysql will use the index for sorting here (meaning, this trick is not needed and is likely slower) – Sergio Tulentsev Jul 07 '16 at 07:16
  • @Sergio Tulentsev - the first benefit is that MySQL can hold much more ROWs in the ram for sorting if it only must sort the id and not the hole row size. but you right it must be benchmarked. – Bernd Buffen Jul 07 '16 at 07:19
  • Nah, you didn't understand my comment on index. When sorting by index, rows are not read yet. – Sergio Tulentsev Jul 07 '16 at 07:23
  • @Bernd Buffen- As I said queries are not same all the time. User can select different criterion from UI for search. Also queries are fired from the application using hibernate..not manually. – Mudit bhaintwal Jul 07 '16 at 07:37
  • This "lazy eval" will help with the existing 2 indexes, but won't help if `INDEX(prop1, prop2, event_timestamp)` is added. (It will hurt because of the tmp table and extra sort.) – Rick James Jul 08 '16 at 22:08