0

I have a table using the TokuDB engine in MariaDB 10.2

The table has a currently about ~700 Million measurements for ~2500 sensors.

I want to get all the distinct sensor names in the table.

There is a index of this names.

Still, MariaDB seems to do a full table scan and needs ~5min.

Anything I can do to speed it up?

The table is:

CREATE TABLE `sensor_data_single` (
`sensor_id` varchar(255) CHARACTER SET latin1 NOT NULL,
`timestamp` bigint(20) NOT NULL,
`value_bool` bit(1) DEFAULT NULL,
`value_boolarray` tinyblob DEFAULT NULL,
`value_bytearray` tinyblob DEFAULT NULL,
`value_date` date DEFAULT NULL,
`value_time` int(11) DEFAULT NULL,
`value_instant` datetime(6) DEFAULT NULL,
`value_double` double DEFAULT NULL,
`value_enum_code` int(11) DEFAULT NULL,
`value_enum_text` varchar(255) CHARACTER SET latin1 DEFAULT NULL,
`value_int` int(11) DEFAULT NULL,
`value_long` bigint(20) DEFAULT NULL,
`value_string` varchar(255) CHARACTER SET latin1 DEFAULT NULL,
`value_type` int(11) DEFAULT NULL,
PRIMARY KEY (`sensor_id`,`timestamp`),
KEY `sensor_data_ts_key` (`timestamp`) `CLUSTERING`=YES,
KEY `sensor_data_id_key` (`sensor_id`)
) ENGINE=TokuDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_520_ci `COMPRESSION`=TOKUDB_FAST

The query is a plain

select distinct sensor_id from sensor_data_single USE INDEX (sensor_data_id_key) order by sensor_id ASC;

Still, while the query runs there is State like Queried about 513770000 rows in SHOW PROCESSLIST

The explain says the index is used, but shouldn't this be much faster considering there are only ~2500 distinct values???

+------+-------------+--------------+-------+---------------+--------------------+---------+------+-----------+-------------+
| id   | select_type | table        | type  | possible_keys | key                | key_len | ref  | rows      | Extra       |
+------+-------------+--------------+-------+---------------+--------------------+---------+------+-----------+-------------+
|    1 | SIMPLE      | sensordata0_ | index | NULL          | sensor_data_id_key | 257     | NULL | 764382458 | Using index |
+------+-------------+--------------+-------+---------------+--------------------+---------+------+-----------+-------------+

If I drop the index on sensor_id, the explain changes to:

+------+-------------+--------------+-------+---------------+---------+---------+------+--------+--------------------------+
| id   | select_type | table        | type  | possible_keys | key     | key_len | ref  | rows   | Extra                    |
+------+-------------+--------------+-------+---------------+---------+---------+------+--------+--------------------------+
|    1 | SIMPLE      | sensordata0_ | range | NULL          | PRIMARY | 257     | NULL | 175017 | Using index for group-by |
+------+-------------+--------------+-------+---------------+---------+---------+------+--------+--------------------------+

but performance doesn't change significantly.

Josef
  • 1,467
  • 2
  • 24
  • 40
  • What does the `EXPLAIN ;` output say? Are you suspecting a full table scan only because it takes so long? We're talking about 700 million rows. Maybe your server is a potato? And you might want to limit your data_types. When your timestamp column is really a timestamp, int is enough, bigint is waste. This alone would save you 2670MB of space for the data alone. Multiply that because it's also in the primary key and the primary key is included in every other index. Do you really need tinyblob for a column named ...boolarray? – fancyPants Jun 30 '17 at 11:10
  • @fancyPants I am suspecting a full scan because the `Queried about 513770000 rows` counts up to the full row count in `SHOW PROCESSLIST`. timestamp has ns precision, tinyblob seems ok, what would you suggest? There is always only one `value_%` not null anyway depending on `value_type` – Josef Jun 30 '17 at 11:18
  • There you have it, your server is a potatoe. ;-) The fact, that there are just 2500 distinct values doesn't matter. Since your server doesn't know that, it still has to scan the index. Just curious, what does the EXPLAIN say, when you don't use the index hint? Your index on sensor_id is not necessary. The primary key covers that already. – fancyPants Jun 30 '17 at 12:39
  • @fancyPants added the explain after dropping the index to the question. So your suggestion is to buy better Hardware? – Josef Jul 03 '17 at 08:31
  • The latter explain looks a bit better. Seems like purchasing better hardware is worth considering. Of course, if this is your only problematic query, then there are ways around it. For example you could populate another table with the distinct values once and update it every once in a while. Or even have triggers on your original table that check in the new table, if the sensor_id already exists, if not, insert it. This of course will have an impact on insert/delete operations on your original table. – fancyPants Jul 03 '17 at 08:42
  • 1
    You could of course also ask at http://dba.stackexchange.com Maybe there you'll get better advice. – fancyPants Jul 03 '17 at 08:43

0 Answers0