How to perform a sum for all previous records

Question

I've been trying to implement the solution here with the added flavour of updating existing records. As an MRE I'm looking to populate the sum_date_diff column in a table with the sum of all the differences between the current row date and the date of every previous row where the current row p1_id matches the previous row p1_id or p2_id. I have already filled out the expected result below:

+-----+------------+-------+-------+---------------+
| id_ | date_time  | p1_id | p2_id | sum_date_diff |
+-----+------------+-------+-------+---------------+
|   1 | 2000-01-01 |     1 |     2 | Null          |
|   2 | 2000-01-02 |     2 |     4 | 1             |
|   3 | 2000-01-04 |     1 |     3 | 3             |
|   4 | 2000-01-07 |     2 |     5 | 11            |
|   5 | 2000-01-15 |     2 |     3 | 35            |
|   6 | 2000-01-20 |     1 |     3 | 35            |
|   7 | 2000-01-31 |     1 |     3 | 68            |
+-----+------------+-------+-------+---------------+

My query so far looks like:

UPDATE test.sum_date_diff AS sdd0
        JOIN
    (SELECT 
        id_,
            SUM(DATEDIFF(sdd1.date_time, sq.date_time)) AS sum_date_diff
    FROM
        test.sum_date_diff AS sdd1
    LEFT OUTER JOIN (SELECT 
        sdd2.date_time AS date_time, sdd2.p1_id AS player_id
    FROM
        test.sum_date_diff AS sdd2 UNION ALL SELECT 
        sdd3.date_time AS date_time, sdd3.p2_id AS player_id
    FROM
        test.sum_date_diff AS sdd3) AS sq ON sq.date_time < sdd1.date_time
        AND sq.player_id = sdd1.p1_id
    GROUP BY sdd1.id_) AS master_sq ON master_sq.id_ = sdd0.id_ 
SET 
    sdd0.sum_date_diff = master_sq.sum_date_diff

This works as shown here.

However, on a table of 1.5m records the query has been hanging for the last hour. Even when I add a WHERE clause onto the bottom to restrict the update to a single record then it hangs for 5 mins+.

Here is the EXPLAIN statement for the query on the full table:

+----+-------------+---------------+------------+-------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------+---------+-------+---------+----------+--------------------------------------------+
| id | select_type |     table     | partitions | type  |                                                              possible_keys                                                              |                   key                   | key_len |  ref  |  rows   | filtered |                   Extra                    |
+----+-------------+---------------+------------+-------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------+---------+-------+---------+----------+--------------------------------------------+
|  1 | UPDATE      | sum_date_diff | NULL       | const | PRIMARY                                                                                                                                 | PRIMARY                                 | 4       | const |       1 |      100 | NULL                                       |
|  1 | PRIMARY     | <derived2>    | NULL       | ref   | <auto_key0>                                                                                                                             | <auto_key0>                             | 4       | const |      10 |      100 | NULL                                       |
|  2 | DERIVED     | sum_date_diff | NULL       | index | PRIMARY,ix__match_oc_history__date_time,ix__match_oc_history__p1_id,ix__match_oc_history__p2_id,ix__match_oc_history__date_time_players | ix__match_oc_history__date_time_players | 14      | NULL  | 1484288 |      100 | Using index; Using temporary               |
|  2 | DERIVED     | <derived3>    | NULL       | ALL   | NULL                                                                                                                                    | NULL                                    | NULL    | NULL  | 2968576 |      100 | Using where; Using join buffer (hash join) |
|  3 | DERIVED     | sum_date_diff | NULL       | index | NULL                                                                                                                                    | ix__match_oc_history__date_time_players | 14      | NULL  | 1484288 |      100 | Using index                                |
|  4 | UNION       | sum_date_diff | NULL       | index | NULL                                                                                                                                    | ix__match_oc_history__date_time_players | 14      | NULL  | 1484288 |      100 | Using index                                |
+----+-------------+---------------+------------+-------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------+---------+-------+---------+----------+--------------------------------------------+

Here is the CREATE TABLE statement:

CREATE TABLE `sum_date_diff` (
  `id_` int NOT NULL AUTO_INCREMENT,
  `date_time` datetime DEFAULT NULL,
  `p1_id` int NOT NULL,
  `p2_id` int NOT NULL,
  `sum_date_diff` int DEFAULT NULL,
  PRIMARY KEY (`id_`),
  KEY `ix__sum_date_diff__date_time` (`date_time`),
  KEY `ix__sum_date_diff__p1_id` (`p1_id`),
  KEY `ix__sum_date_diff__p2_id` (`p2_id`),
  KEY `ix__sum_date_diff__date_time_players` (`date_time`,`p1_id`,`p2_id`)
) ENGINE=InnoDB AUTO_INCREMENT=1822120 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

MySQL version is 8.0.26 running on a 2016 MacBook Pro with Monterey with 16Gb RAM.

After reading around about boosting the RAM available to MySQL I've added the following to the standard my.cnf file:

innodb_buffer_pool_size = 8G
tmp_table_size=2G
max_heap_table_size=2G

I'm wondering if:

I've done something wrong
This is just a very slow task no matter what I do
There is a faster method

I'm hoping someone could enlighten me!

Subqueries aren't indexed. You may get an improvement by saving the results of the subquery to a temporary table, add an index to it, then join with the temporary table. — Barmar, Dec 08 '21 at 21:17
@Barmar - Newer versions (including 8.0) of MySQL will _dynamically_ build an index for a _derived_ table. — Rick James, Dec 08 '21 at 22:29
These are dangerously high; leave them at defaults: tmp_table_size=2G max_heap_table_size=2G — Rick James, Dec 08 '21 at 22:42
https://stackoverflow.com/questions/54670414/mysql-sum-and-min-after-each-operation https://stackoverflow.com/questions/58425089/is-it-possible-to-index-a-running-total-to-speed-up-window-functions https://dba.stackexchange.com/questions/268214/how-to-create-a-running-balance-column-based-on-account-number — Rick James, Dec 08 '21 at 22:44
The [MySQL windowing functions doc](https://dev.mysql.com/doc/refman/8.0/en/window-functions-usage.html) has examples that are close to what you want. Windowing functions are what you need to lower execution time. — Bohemian, Dec 08 '21 at 22:57
@RickJames (&Bohemian) - I've read through all the window function material and it's opened up my eyes as to what is possible :) However, I can't work out exactly how to integrate the DATE_DIFF function so as to calculate a sum of all the differences for previous rows. The examples given are much simpler and focus on things like rolling sums which just sum from a frame of previous rows. I need to try and integrate a function that is relative to the current row. Any direction you could give me? — Jossy, Dec 09 '21 at 00:33
@Jossy - Would the SUM(diffs) be the same as DIFF(current_row, first_row)? — Rick James, Dec 09 '21 at 01:51
@RickJames - I don't think it would. DIFF(current_row, first_row) would only give the difference between the current and first frame row. I need the sum of all the differences between the current row and each of the rows in the frame. If it helps to understand why I need this then the non-MRE version of the query applies a function to the `DATE_DIFF` for each row and then sums all the values. In this way I'm able to create a time weighted count of the number of rows. — Jossy, Dec 09 '21 at 02:04
@Jossy - Would you like to provide a small dataset _and_ show what the query should return? — Rick James, Dec 09 '21 at 02:06
@Jossy - For starters, the dates are not in a compatible format. For example, `15/01/2000` should be `2000/01/15`. — Rick James, Dec 09 '21 at 02:42
@RickJames - sorry, built it in Excel and copied across, will amend. For each row then I need the sum of the differences between the `date_time` of that row and the `date_time` of every preceding row (by `p1_id`). — Jossy, Dec 09 '21 at 02:49
@RickJames - I started down that road but couldn't figure out how to work in the `SUM` I need as I can't wrap it around a window function. I've now got as far as `DATEDIFF(LAST_VALUE(date_time) OVER (PARTITION BY p1_id ORDER BY date_time RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING), date_time)` which gets me the difference between a row's `p1_id` `date_time` and the most recent `p1_id` `date_time` but this isn't what I need and I'm stumped :( — Jossy, Dec 09 '21 at 03:18
Hey @RickJames - thought I'd exhausted your patience :) Thanks a million for sticking with this. `sum_date_diff` isn't used in my 'real' code. I actually apply different types of calculations to each one of the individual differences and then sum these together. An example in semi-pseudo code would be `SUM(POWER(constant, DATEDIFF(current row date, every previous row date)))`. This example gives me a time discounted count of rows. The other calculation variants typically use this discounted count in combo with other current row fields and their corresponding previous row fields. — Jossy, Dec 10 '21 at 18:11
@Jossy - Might it be easier to use some function of "how long ago"? I like an exponential moving average, but it does not take into account "how long" between items. — Rick James, Dec 10 '21 at 20:09
Thanks Rick. The challenge as I see it with an EMA is that it assumes you have data points equally spaced in time which mine aren't :( I was really hoping the solution might lie in SQL but from your answer it sounds like I'm going to need to shift everything into Pandas — Jossy, Dec 10 '21 at 21:15

score 1 · Answer 1 · answered Dec 10 '21 at 20:12

Whereas it is possible to do calculations like this in SQL, it is messy. If the number of rows is not in the millions, I would fetch the necessary columns into my application and do the arithmetic there. (Loops are easier and faster in PHP/Java/etc than in SQL.)

LEAD() and LAG() are possible, but they are not optimized well (or so is my experience). In an APP language, it is easy and efficient to look up things in arrays.

The SELECT can (easily and efficiently) do any filtering and sorting so that the app only receives the necessary data.

How to perform a sum for all previous records

1 Answers1