How to make this group by faster

Question

I have a big table with over 3B rows and my query now takes 3 days to execute.

This is my query:

insert into frequencywords (word, frequency, filename)
select word, count(*) as frequency, max(filename)
from allwordstemp
group by word

Basically, my query group by word from allwordstemp table, and I want to know the filename when frequency = 1 that's why I added max(filename) because filename needs to be contained in an aggregate function such as max. I also don't need the value of filename if frequency > 1. Both tables have 2 indexes on word and filename.

allwordstemp table (filename is an id file):

CREATE TABLE `allwordstemp` (
  `word` varchar(45) DEFAULT NULL,
  `filename` int(11) DEFAULT NULL,
  KEY `idx_allwordstemp_word` (`word`),
  KEY `idx_allwordstemp_filename` (`filename`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

frequencywords table:

CREATE TABLE `frequencywords` (
  `word` varchar(45) DEFAULT NULL,
  `frequency` int(11) DEFAULT NULL,
  `filename` int(11) DEFAULT NULL,
  KEY `idx_frequencywords_word` (`word`),
  KEY `idx_frequencywords_frequency` (`frequency`),
  KEY `idx_frequencywords_filename` (`filename`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

EXPLAIN SELECT:

+----+-------------+--------------+------------+-------+---------------+-----------------------+---------+------+------------+----------+-------------+
| id | select_type | table        | partitions | type  | possible_keys | key                   | key_len | ref  | rows       | filtered | Extra       |
+----+-------------+--------------+------------+-------+---------------+-----------------------+---------+------+------------+----------+-------------+
|  1 | SIMPLE      | allwordstemp | NULL       | index | NULL          | idx_allwordstemp_word | 138     | NULL | 3487864881 |   100.00 | Using index |
+----+-------------+--------------+------------+-------+---------------+-----------------------+---------+------+------------+----------+-------------+

How can I make my query faster?

Why did you not include how you created the indexes, and the table ? — Luuk, Apr 01 '22 at 08:50
Creating 1 index, on the 2 fields `word` and `filename`, would help, because MySQL can use that index to get the count, and in your query all records are read because the needed info is not found in the index used. — Luuk, Apr 01 '22 at 09:06
Ok thank you I will add an index on the two fields. Is there anything else I can do to make it more faster? I feel like the query is doing extra unnecessary work when frequency > 1 — Mehdi Souregi, Apr 01 '22 at 09:10
The other option might be to determine the frequency first (like was suggested in the answer by Aayush), and after that get the filename. To determine the frequency, you already have the needed index `idx_allwordstemp_word` — Luuk, Apr 01 '22 at 09:13
I suspect the query will ignore any indexes; I await the `EXPLAIN` for confirmation. — Rick James, Apr 01 '22 at 16:55
@RickJames sorry for the late reply, I have updated my question. — Mehdi Souregi, Apr 02 '22 at 17:52
Depending on settings you are not forced to aggregate in MySQL, if you just select the filename without MAX() it will pick one arbitrarily. If you add a CASE expression to force it to NULL when not needed, it will make indexing on the new table easier; `CASE WHEN COUNT(*) = 1 THEN filename END`. Finally, inserting potentially billions of rows is always going to be slow, as it is both building indexes and having to update logs, etc, to facilitate rollback, etc; do it in batches (such as the first 1000 words, then the second 1000 words, etc). — MatBailie, Apr 02 '22 at 18:26
How does this perform? `insert into frequencywords (word, frequency, filename) select word, count(*) as frequency, case when count(*) = 1 then filename end from allwordstemp where word like 'Z%' group by word` — MatBailie, Apr 02 '22 at 18:39
@MatBailie I used the HAVING clause because I only need the rows having frequency =1 for now. I am also not forced to aggregate but the performance did not change when I used filename without MAX(), it's always ~300 read rows/s (I am checking the InnoDB status). However, if i run the query without using the filename at all (by using just word column) i have ~70000 read rows/s, the query took 13hours to finish. — Mehdi Souregi, Apr 02 '22 at 18:40
Wait, your question says you don't need the FILENAME when `count(*)>1`? Now you don't need the ROW when `count(*)>1`? (You only need words that appear in one file, and so frequency is always going to be 1???) — MatBailie, Apr 02 '22 at 18:42
@MatBailie I am trying to submit the first results of an experiment. The initial query is probably gonna take too much time (it took 3days and we had to stop it yesterday) that's why we (me and my tutor) thought today it would be better to start with the HAVING clause, send the results, and then see if we need to know more when frequency > 1. So for now I am adding having a HAVING clause and I've created the index (word, filename) which hasn't finished yet (12 hours now) — Mehdi Souregi, Apr 02 '22 at 18:50
What's the disk setup? Are logs and tables on the same disk(s)? What about the indexes? Is the temp table on the same disk as the target table? (You may be beating the shit out of your disk(s) now, by running these concurrently. Do you have anything monitoring the hardware?) — MatBailie, Apr 02 '22 at 18:53
@MatBailie Same disk (HDD) unfortunately. The logs, the temp table, and the big table are on the same disk. (Do you have anything monitoring the hardware? the disk is located on a private remote server) — Mehdi Souregi, Apr 02 '22 at 18:56
Wait for the index to build. Or choose when to give up *(sorting is unavoidably expensive and slow)*. Check for any rollbacks, etc, in progress; cancelling long running actions isn't free, it takes time to undo. If desperate, force the server to reboot and allow the database instance to recover. Then, with or without the new index, remove the indexes from the target table, to avoid having to build them *(you may wish to copy to an indexed table later)*. Then run the insert in batches, such as by first letter, start with a small batch to test, such as `LIKE 'Z%'`. Avoid concurrent actions. — MatBailie, Apr 02 '22 at 19:11
As for the index, I am afraid that if I stop it I lose everything, anyway, I will make the decision about it tomorrow morning. The target table is predicted to contain ~1.5M rows (I already run the query without using the filename just to have an idea), so I don't think that deleting the indexes on that "small" target table will have a huge impact, but I will do it anyway. As for dividing the query by small batches, I also don't know about it, I will test it once the index build is finished. In any case, I will let you know what was the best solution for me. Thank you for your help! — Mehdi Souregi, Apr 02 '22 at 19:29
That index will likely take at least as long as your original query. Sorry. — MatBailie, Apr 02 '22 at 19:38
@MatBailie the index build is finished, it took 9hours, I will test now the different options — Mehdi Souregi, Apr 02 '22 at 22:01

score 0 · Answer 1 · answered Apr 01 '22 at 08:50

0

Your filter criteria appears to be over frequency, and not word or filename. SO the first thing I'd do is index allwordstemp by frequency.

Then, assuming frequency is a whole number, I'd add a WHERE clause like so:

select word, max(filename)
from allwordstemp
where frequency = 1
group by word

The above will give you a list of all words that appear EXACTLY once in various file names.

I hope this helps, and all the best!

answered Apr 01 '22 at 08:50

Aayush Kothari

526
3
20

but frequency is computed during the execution of the groupby clause – Mehdi Souregi Apr 01 '22 at 08:55
@MehdiSouregi Ah, could you please help us with a sample dataset that outlines schema of frequencywords as well as allwordstemp? – Aayush Kothari Apr 01 '22 at 09:00
Thank you @MehdiSouregi - If I understand the schema correctly, you appear to already have frequencies stored in the frequencywords table. Why aren't you querying that table instead of allwordstemp? – Aayush Kothari Apr 01 '22 at 09:22
frequencywords is empty. I am trying to filling it with the result of the words grouped from allwordstemp – Mehdi Souregi Apr 01 '22 at 09:25
@MehdiSouregi Gotcha! I feel you could gather help from the following post: https://stackoverflow.com/questions/5226202/word-frequencies-from-strings-in-postgres – Aayush Kothari Apr 01 '22 at 09:53
Once you have computed frequency, you need to move the filtering to `HAVING`. It won't perform much different. – Rick James Apr 01 '22 at 16:54
@RickJames yes, I have used the HAVING clause to add only the rows having frequency = 1 but the performance did not change as you said. I am adding now the index (word, filename) as Luuk suggested. It took 1 day, still not finished.. – Mehdi Souregi Apr 02 '22 at 18:01
@MehdiSouregi - What version of MySQL? How much RAM? What is the value of `innodb_buffer_pool_size`? – Rick James Apr 02 '22 at 21:34
@RickJames Mysql version 5.7. RAM about 18Go. innodb_buffer_pool_size = 8589934592 – Mehdi Souregi Apr 02 '22 at 22:09

score 0 · Answer 2 · answered Apr 02 '22 at 18:06

0

Plan A

That is strange. It is using an index that it should avoid. Try the "index hint" IGNORE INDEX(idx_allwordstemp_word)

How often do you run this query? Does the table always have billions of rows? Do the two columns need to be NULL?

Plan B:

If they could be NOT NULL, we could add a id BIGINT AUTO_INCREMENT and change to

PRIMARY KEY(word, filename, id),
INDEX(id),
INDEX(filename)

Making such a change now might take days; I am suggesting this for next time.

If there is no need for indexing filename, then leave that out. (It is not useful in the current Select.)

Plan C

Replace KEY idx_allwordstemp_word (word) with

INDEX(word, filename)

(Again, this may take a long time.)

Bottom line

If this is one-time task, use Plan A. If the columns need to be NULLable, use Plan C. Else, Plan B is probably the best.

answered Apr 02 '22 at 18:06

Rick James

135,179
13
127
222

It's grouping by word, why should it ignore the index on word? – MatBailie Apr 02 '22 at 18:14
@MatBailie - That's great for the `GROUP BY`, but terrible for the `MAX(filename)`. If the index contains _both_ columns, `word`, first, it can easily fo both the `COUNT` and the `MAX`. Meanwhile, since you don't have an `ORDER BY`, it does not matter what order the results are in. – Rick James Apr 02 '22 at 21:32
I agree that using such a composite index would better, but based on what the OP had in the question (with two monotonic indexes only), using the index on `word` does appear to be the correct choice by the optimiser? I just don't get the sentence `It is using an index that it should avoid.` – MatBailie Apr 02 '22 at 22:20
@RickJames I went for plan C because I don't have much time to test the different plans, however, I did not replace the Index(word) I just created a new INDEX(word, filename). Now I can see that I have ~200000read/s instead of 300read/s. I run the query it should be finished in 5hours. – Mehdi Souregi Apr 02 '22 at 22:25

How to make this group by faster

2 Answers2