0

So, I have data that looks something like this

User_Object | filesize | created_date | deleted_date
row 1       | 40        | May 10       | Aug 20
row 2       | 10        | June 3       | Null
row 3       | 20        | Nov 8        | Null

I'm building statistics to record user data usage to graph based on time based datapoints. However, I'm having difficulty developing a query to take the sum for each row of all queries before it, but only for the rows that existed at the time of that row's creation. Before taking this step to incorporate deleted values, I had a simple naive query like this:

SELECT User_Object.id, User_Object.created, SUM(filesize) OVER (ORDER BY User_Object.created) AS sum_data_used
    FROM User_Object
    JOIN user ON User_Object.user_id = user.id
    WHERE user.id = $1

However, I want to alter this somehow so that there's a conditional for the the window function to only get the sum of any row created before this User Object when that row doesn't have a deleted date also before this User Object.

This incorrect syntax illustrates what I want to do:

SELECT User_Object.id, User_Object.created, 
        SUM(CASE WHEN NOT window_function_row.deleted
            OR window_function_row.deleted > User_Object.created
            THEN filesize ELSE 0)
        OVER (ORDER BY User_Object.created) AS sum_data_used
    FROM User_Object
    JOIN user ON User_Object.user_id = user.id
    WHERE user.id = $1

When this function runs on the data that I have, it should output something like

id      | created | sum_data_used|
1       | May 10  | 40
2       | June 3  | 50
3       | Nov 8   | 30
John Kulp
  • 123
  • 1
  • 12

1 Answers1

1

Something along these lines may work for you:

SELECT a.user_id
      ,MIN(a.created_date) AS created_date
      ,SUM(b.filesize) AS sum_data_used
  FROM user_object a
  JOIN user_object b ON (b.user_id <= a.user_id
                    AND COALESCE(b.deleted_date, a.created_date) >= a.created_date)
  GROUP BY a.user_id
  ORDER BY a.user_id

For each row, self-join, match id lower or equal, and with date overlap. It will be expensive because each row needs to look through the entire table to calculate the files size result. There is no cumulative operation taking place here. But I'm not sure there is a way that.

Example table definition:

create table user_object(user_id int, filesize int, created_date date, deleted_date date);

Data:

1;40;2016-05-10;2016-08-29
2;10;2016-06-03;<NULL>
3;20;2016-11-08;<NULL>

Result:

1;2016-05-10;40
2;2016-06-03;50
3;2016-11-08;30
Glenn
  • 8,932
  • 2
  • 41
  • 54
  • Thank you very much for the suggestion! It's unfortunate that this is going to end up being an expensive query, but that's why I wanted to run it by here before I did something awful with subqueries that had even worse than this n^2 complexity – John Kulp Jan 06 '17 at 22:28