0

Sorry. This is a rewritten question because there was no answer.

I changed it from rank() to row_number() because the same rank is not needed.


I'm currently working on a query, but I'm not getting the speed so I'm asking a question.

What I want to find is 'count the most names in each category by DataID'.

Data_Category { id : INT PRIMARY KEY AUTO_INCREMENT, DataId: INT, name: varchar2(200), category: varchar2(200) }

and it is possible to input in duplicate.

select dataId, name, category, count(*) as cnt
, row_number() over (partition by dataId, category order by count(*) desc) as "ranking"
from data_category
group by dataId, name, category

In the above query, there is a difference in speed with and without the rank part.

If there is no row_number, it takes 0.0053, and if there is, it takes 0.5 seconds. The result is about 260,000 rows.

select *, row_number() over (partition by A.dataId, A.category order by cnt desc) as "ranking"
from(
  select dataId, name, category, count(*) as cnt
  from data_category
  group by dataId, name, category
) A

The above query also yields almost the same speed. The moment it is wrapped in a select statement, it is slowing down due to the temporary table creation.

The data is about 400,000 cases.

The index is (dataId, category, name) .

I want to extract 5 names with the highest number of names in each category by data ID, but I can't figure out the direction of the query.

And I will create 'view' using this query. So there are no additional where clause. After that, I want to use it by joining it with another table.

Even if you don't write a query, I sincerely ask for a simple hint..!

Output:

dataId category name rank
1 cate1 name1_1 1
1 cate1 name1_2 2
1 cate1 name1_5 3
1 cate1 name1_3 4
1 cate2 name2_1 1
1 cate2 name2_5 2
1 cate2 name2_3 3
2 cate1 name3_1 1
2 cate3 name3_9 1

Thank you.


Explain

#Query 1-2 : Without rank_row

id select_type table type key rows filtered extra
1 SIMPLE data_cateogory index IDX_39ea6497ea9ca40d9e783e 432344 100.00 Using index;

#Query 1-1 : With rank_row

id select_type table type key rows filtered extra
1 SIMPLE data_cateogory index IDX_39ea6497ea9ca40d9e783e 432344 100.00 Using index; Using temporary; Using filesort
Rick James
  • 135,179
  • 13
  • 127
  • 222
  • 2
    For performance questions it's useful if you include the output of an EXPLAIN. – P.Salmon Jun 14 '22 at 08:06
  • Have you read https://dev.mysql.com/doc/refman/8.0/en/group-by-optimization.html#:~:text=The%20most%20general%20way%20to,aggregate%20functions%20(if%20any). – P.Salmon Jun 14 '22 at 08:12
  • @P.Salmon Thanks for the comment. I added explain result. I'll read the links in other comments. –  Jun 14 '22 at 08:23
  • @P.Salmon It doesn't seem to be a problem with group by. It seems to slow down in the process of sorting by count(*). –  Jun 14 '22 at 08:24
  • The window function is applied over grouped rowset, i.e. at the moment when the index usage is not available, and aggregated value is used for sorting. By fact this is "double grouping". And the data size is approx. 200-400 Mb, so memory temptable is not applicable. I do not see the way to this query improvement. – Akina Jun 14 '22 at 10:25
  • @Akina thank you. It is unfortunate that there is no way to improve it. Or is there any other way to get the rank quickly? –  Jun 15 '22 at 03:07
  • You do not use any filtering - so pre-calculation may be a solution. – Akina Jun 15 '22 at 04:09
  • @Akina What does `pre-calculation` mean? Is it a way to create a table and put the calculated values ​​in advance? –  Jun 15 '22 at 04:13
  • This means that you create additional table which stores the data obtained by your slow query. You may renew its data by triggers and/or recalculate it completely from event procedure. When you need this data you simply query this table. Of course you may not calculate everything - for example, COUNT() is precalculated whereas RANK() is calculated in the query (and it uses suitable index which is present in this additional table). You loose in disk space but win in execution time. – Akina Jun 15 '22 at 04:17
  • @Akina This data is frequently inserted, updated, and deleted. Is it ok to use a trigger? –  Jun 15 '22 at 04:23
  • Depends on data change type. If the data is altered with small chunks (starting from single row altering and up to ~100-1000 rows per query), and this process is not time-critical, then triggers usage is safe. Anycase this must be tested. – Akina Jun 15 '22 at 06:47
  • @Akina After creating the view, I want to use it for search. Speed ​​is important to me. I will create a trigger and test it. –  Jun 15 '22 at 06:54
  • @Akina Test completed. Creating the trigger directly in mysql was faster than using @AfterInsert(), @BeforeInsert() etc. Thank you! –  Jun 15 '22 at 08:41
  • What is the sample input for that output? (I don't understand 'count the most names in each category by DataID'.) – Rick James Jun 20 '22 at 22:06
  • Please provide the indexes for the table -- `SHOW CREATE TABLE`. – Rick James Jun 20 '22 at 22:07
  • Please provide `EXPLAIN FORMAT=JSON SELECT...` -- this will give more details. – Rick James Jun 20 '22 at 22:09

1 Answers1

0

Does this give you what you need?

SELECT DataID, category, COUNT(*)
    FROM t
    GROUP BY DataID, category;

Or, if name has duplicates, replace COUNT(*) by COUNT(DISTINCT name)

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Thank you for answer. But I have to count all duplicate data. And the sql you wrote could not get the data I wanted. –  Jun 24 '22 at 04:27
  • 1
    @KunAguero - Duplicate relative to DataID? To category? Other? Give the desired results. – Rick James Jun 24 '22 at 04:44
  • All data can be duplicated. I want to extract 5 names with the highest number of names in each category by data ID, but I can't figure out the direction of the query. The desired results are tabulated above. –  Jun 24 '22 at 05:53
  • 1
    @KunAguero - That sounds more like a "groupwise max". See the tags. – Rick James Jun 24 '22 at 14:20