0

I'm working with historical tables which are storing each change from "real" tables.

At the moment of retrieving data in a list of timestamps, the performace is horrible.

Here a simplified version of my tables.

CREATE TABLE `changes` (
  `ts` datetime DEFAULT NULL
) ENGINE=InnoDB;

CREATE TABLE `history` (
  `h_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `start_ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `end_ts` timestamp NULL DEFAULT NULL,
  `pk` int(11) DEFAULT NULL,
  `note` mediumtext,
  PRIMARY KEY (`h_id`),
  KEY `history_pk_ts_tsev_IDX` (`pk`,`start_ts`,`end_ts`) USING BTREE
) ENGINE=InnoDB;

And this is the query i'm running:

SELECT * FROM `changes` AS `c`
JOIN `history` AS `h`
ON (`h`.`pk` = 9999
    AND `c`.`ts` >= `h`.`start_ts`
    AND `c`.`ts` < IFNULL(`h`.`end_ts`, `c`.`ts` + 1)
   )

With 2.500 row in changes and 55.000 rows in history, the query takes about 8 seconds to get the first row, and about 2 minutes to get all.

This tables will grow very rapidly, storing soon milions of rows.

This is the explain result:

|id |select_type |table |partitions |type |possible_keys          |key |key_len |ref |rows  |filtered|Extra                                              |
|---|------------|------|-----------|-----|-----------------------|----|--------|----|------|--------|---------------------------------------------------|
|1  |SIMPLE      |c     |           |ALL  |                       |    |        |    |2448  |100     |                                                   |
|1  |SIMPLE      |h     |           |ALL  |history_pk_ts_tsev_IDX |    |        |    |54227 |16.66   |Using where; Using join buffer (Block Nested Loop) |

I've tried to force the index:

SELECT * FROM `changes` AS `c`
JOIN `history` AS `h` FORCE INDEX (history_pk_ts_tsev_IDX)
ON (`h`.`pk` = 2476
    AND `c`.`ts` >= `h`.`start_ts`
    AND `c`.`ts` < IFNULL(`h`.`end_ts`, `c`.`ts` + 1)
   )

but now this query takes about 10 seconds for the first row. Again, the explain:

|id |select_type |table |partitions |type |possible_keys          |key                    |key_len |ref   |rows  |filtered|Extra                 |
|---|------------|------|-----------|-----|-----------------------|-----------------------|--------|------|------|--------|----------------------|
|1  |SIMPLE      |c     |           |ALL  |                       |                       |        |      |2448  |100     |                      |
|1  |SIMPLE      |h     |           |ref  |history_pk_ts_tsev_IDX |history_pk_ts_tsev_IDX |5       |const |27113 |33.33   |Using index condition |

I've tried to create indexes in any order and combinations of them with no luck:

  • (pk, start_ts, end_ts)
  • (start_ts, end_ts)
  • (end_ts)
  • (end_ts, start_ts)
  • ...

I've tested in the biggest MySQL RDS available in AWS with the same results, hence is not a hardware problem.

I'm really lost. What am I missing?

Thanks.

  • 1
    Exploit a feature called PARTITION PRUNING by partitioning your table by timestamp that you want to query on. This comes with its own caveats. – MontyPython Feb 26 '19 at 13:47
  • Thanks @MontyPython Although I use the partition, the typical data volume I will work with is about 10,000 records in `changes` and over 2,000,000 in `history`. – Angel Calderón Feb 26 '19 at 16:00
  • Are the time ranges in your history table overlapping (e.g. can a single `changes`-timestamp be in (at most) one range (per `pk`) or can it be in several?). Could you try `SELECT count(*) FROM ...` to eliminate the time it might take to send 50k rows to your client from your measurement? How many rows are there per (typical) `pk` (or in other words: how many different pk-value do you have in your 55k rows)? – Solarflare Feb 26 '19 at 17:03
  • Create an index on changes. You don't have an index there. :( – MontyPython Feb 26 '19 at 17:25
  • @Solarflare I forget to say that the 8 seconds is to get the first row. Using `select count(*)` the query takes about 2 minutes to execute. I can not assure that there is not more than one record with the same timestamp (date1, date2, pk1), (date2, date2, pk1), (date2, date2, pk1), (date2, date3, pk1) due to very fast changes. – Angel Calderón Feb 26 '19 at 18:01
  • @MontyPython the `changes` table is created for the example. The information is gathered from multiple tables with joins and unions. The generation of this data is not a problem, takes a few milliseconds. I've still tried to create the index in `changes` and it has been worse: now it takes minutes. – Angel Calderón Feb 26 '19 at 18:04
  • Wow, 2 minutes is really a lot on such a small table, there seem to be something going horrible wrong. My question about the ranges was: can the ranges overlap, so can there be `(date1, date3), (date2, date4)` (for the same pk) or is the next row always `>=` the previous end-date, maybe excluding `null`? (If I understand you correctly, `(date1, date2), (date2, date3)` is possible?). Also: what is the `count(*)` you are getting for your sample data? And how many history rows do you have for pk = 9999 in total? – Solarflare Feb 26 '19 at 18:41

1 Answers1

0

Maybe you can try multithreading to read your datatable. The answer should be there from another question.

omega478
  • 7
  • 4