4

I have a large MySQL table, even when properly indexed it can take 1 second for each query (doesn't sound like much but it is run for thousands of servers). Right now, I have four queries going through to get 95th percentile inbound, 95th percentile outbound, and the sum of both.

Query 1: To get number of rows to get 95th percentile row

SELECT round(count(*)*.95 FROM traffic WHERE server_id = 1;

Query 2&3 To get 95th percentile

SELECT inbound FROM traffic WHERE server_id = 1 ORDER BY inbound ASC LIMIT {95th},1
SELECT outbound FROM traffic WHERE server_id = 1 ORDER BY outbound ASC LIMIT {95th},1

Query 4 Get sum of traffic

SELECT sum(inbound+outbound) FROM traffic WHERE server_id = 1; 

Can you think of any way I could combine these? I am challenged to think of a way since I need to get the 95th percentile, which is calculated by selecting a specific row based on the count. For example, if there are 10000 rows, then you order them ascending and select the 9500th row.

SheetJS
  • 22,470
  • 12
  • 65
  • 75
Devon Bessemer
  • 34,461
  • 9
  • 69
  • 95

2 Answers2

3

If you are willing to give up some precision, you can use estimate for row count rather than exact row count. If your database is using InnoDB, SELECT count(*) could be very slow. In other words:

  1. To get estimate, you could use SHOW TABLE STATUS command. It will be lightning fast, but not necessarily 100% accurate.

  2. Replace your statement:

    SELECT inbound FROM traffic WHERE server_id = 1 ORDER BY inbound ASC LIMIT {95th},1
    

    with

    SELECT inbound FROM traffic WHERE server_id = 1 ORDER BY inbound DESC LIMIT {5th},1
    

    Result should be identical, but about 20x faster. Just make sure to create compound index on (server_id, inbound).

  3. see 2.

  4. Leave this alone.

I expect that total time to get necessary numbers will be reduced to few milliseconds.

mvp
  • 111,019
  • 13
  • 122
  • 148
  • How is this combining any of the queries? – SheetJS Oct 25 '13 at 04:01
  • @Nirk: it is not. OP complained about performance, and was thinking that combining into one query would speed it up. I propose solution that makes it really fast, which is really what bothers OP. – mvp Oct 25 '13 at 04:03
  • Using a compound query for each inbound,outbound rather than just server_id seemed to increase index size incredibly and only shave off about 2% of time. Is there something I should be doing differently than just: ADD INDEX `compound` USING BTREE (`server_id` ASC, `inbound` ASC); – Devon Bessemer Oct 25 '13 at 04:09
  • @Devon: did you change it to use 5th percentile instead of 95th? Regarding indexing: you need to have 2 indexes: `(server_id ASC, inbound DESC)`, and `(server_id ASC, outbound DESC)` - ASC and DESC must match order used in query. Be sure to confirm that MySQL is using correct indexes with `EXPLAIN ANALYZE`. – mvp Oct 25 '13 at 04:15
  • I was still using ASC and 95th in the query for comparison of just the index. Nirk's suggestion seems to work pretty well. – Devon Bessemer Oct 25 '13 at 04:20
  • @Devon: changing to 5th will improve it by factor of 20 – mvp Oct 25 '13 at 04:22
2

As noted in http://planet.mysql.com/entry/?id=13588 :

SELECT
    SUBSTRING_INDEX(
            SUBSTRING_INDEX(
                GROUP_CONCAT( 
                    t.inbound
                    ORDER BY t.inbound
                    SEPARATOR ','
                )
            ,   ','
            ,   95/100 * COUNT(*) + 1
            )
        ,   ','  
        ,   -1  
        )                 AS `Inbound95`
    ,
    SUBSTRING_INDEX(
            SUBSTRING_INDEX(
                GROUP_CONCAT(  
                    t.outbound
                    ORDER BY t.outbound
                    SEPARATOR ','
                )
            ,   ','         
            ,   95/100 * COUNT(*) + 1 
            )
        ,   ','                       
        ,   -1                          
        )                 AS `Outbound95`
FROM   traffic AS t WHERE t.server_id = 1

will give you the two percentiles

NOTE: you may need to increase group_concat_max_len

SheetJS
  • 22,470
  • 12
  • 65
  • 75
  • Technically this gives answer in a single query. But, it is going to be damn slow – mvp Oct 25 '13 at 03:51
  • @mvp You can't combine the two 95th percentiles into one query in an efficient manner because the query sort orders are different (apples and oranges). – SheetJS Oct 25 '13 at 03:55
  • Well, even if you replace 95th percentile with 5th percentile (see how in my answer), it will make your query considerably faster. – mvp Oct 25 '13 at 04:00
  • This actually is much faster but it does not display the correct 95th percentile values unfortunately. I don't know enough about SUBSTRING_INDEX to analyze it though. – Devon Bessemer Oct 25 '13 at 04:24
  • @Devon is it off by 1? You may need to adjust by one index position – SheetJS Oct 25 '13 at 04:26
  • It was off by a lot. For example, the query I tested had 27573 rows, the 95th percentile row should have been the 26194 row when ordered ascending. The 26194 row was 1980, this query returned 628. It doesn't seem to work right as when I switched the order by adding a DESC after ORDER BY it increases the percentile which it should reduce if it is calculating the 95th percentile row on a DESC column. – Devon Bessemer Oct 25 '13 at 04:33
  • @Nirk I found that the 95/100 * COUNT(*) + 1 line seems like it is limited to 256 rows. I can literally replace it with 256 and get the same result. Replacing it with anything higher has no effect, so the count of 26194 does not work. – Devon Bessemer Oct 25 '13 at 04:41
  • @Devon I wonder if that line is giving a fractional value. What happens if you floor it: `FLOOR(95 / 100 * COUNT(*) + 1)` – SheetJS Oct 25 '13 at 04:45
  • 1
    @Nirk I found that the issue was related to group_concat_max_len. After raising that limit, it solved the issue. Thanks for your help! – Devon Bessemer Oct 25 '13 at 04:48
  • @Devon glad to have helped :) – SheetJS Oct 25 '13 at 04:49