How Do I aggregate Data By Day and Still Respect Timezone?

Question

We are currently using a summary table that aggregates information for our users on an hourly basis in UTC time. The problem we are having is that this table is becoming too large and slowing our system down immensely. We have done all the tuning techniques recommended for PostgreSQL and we are still experiencing slowness.

Our idea was to start aggregating by day rather than by hour, but the problem is that we allow our customers to change the timezone, which recalculates the data for that day.

Does anyone know of a way to store the daily summary but still respect the numbers and totals when they switch timezones?

Strictly speaking for data modeling, you're losing timezone level of detail when you go to day granularity. But, you may be able to aggregate by timezone, especially if the answer to @MPelletier's question is "No". — bobs, Aug 06 '10 at 22:37
@MPelletier we aggregate by hour right now, so we only support timezones that are "on the hour" — Russ Bradberry, Aug 06 '10 at 22:39
@bobs There is no time zone information for the row of data, it is all in UTC. The change comes when a user wants to see how much they made on a certain day, in a certain timezone. — Russ Bradberry, Aug 06 '10 at 22:41
to further clarify, if they made $1 an hour every hour (UTC) on the 1st and then on the 2nd and 3rd they made $2 an hour every hour (UTC) if they view the 2nd in EST it should add up to $20, and if they change their timezone to UTC then it will add up to $24 — Russ Bradberry, Aug 06 '10 at 22:44

score 5 · Accepted Answer · answered Aug 06 '10 at 22:39

5

Summarise the data in tables with a timeoffset column, and a "day" field (a date) that is the day for that particular summary line. Index on (timeoffset, day, other relevant fields), clustered if possible (presumably PostgresSQL has clustered indexes?) and all should be well.

answered Aug 06 '10 at 22:39

Will A

24,780
5
50
61

1

So, instead of 24 lines per day, one day would produce one line... times 24 time zones. I can't see substantial gain here. – MPelletier Aug 06 '10 at 22:47
i thought about this, but then i have to maintain 24 summary tables which will also increase the possibility of a difference in reporting between timezones. – Russ Bradberry Aug 06 '10 at 22:53
2

@MPelletier - the difference is that you don't need to aggregate the 24 lines for a day to produce a daily figure - you pull out the one summary line for that particular timeoffset / day - so you're doing 1/24 of the work - with proper indexing of course. – Will A Aug 06 '10 at 23:04
@Russ - you don't have 24 summary tables - just the one, but with a timeoffset column which indicates the # of hours from GMT, and the day column indicates the day for that particular time offset. You'd have 24 rows for the 24 time offsets (or more or less if you need more or less time zones). – Will A Aug 06 '10 at 23:05
the hourly table has over 10M rows already, which is why i beleive the performance is degrading. not necessarily in the number of rows it has to aggregate, but the total amount it has to filter. the indexes are growing too large i beleive. – Russ Bradberry Aug 06 '10 at 23:12
also, my clustered index is on two other important columns, and with postgres, you only get one per table – Russ Bradberry Aug 06 '10 at 23:13
@Russ Bradberry: Can you trick Postgre with views? Say you index the day, is there a gain for a timezone-specific view? Or the other way around: a nested query for days (indexed) inside a query for the timezone? – MPelletier Aug 07 '10 at 05:34
@Will A, this may be viable with a columnar db solution. ive been looking into a couple and while it is a hefty initial overhead, it may prove to be a good solution. – Russ Bradberry Aug 08 '10 at 03:22
1

@Russ - I don't think you need worry about your existing clustered index - if you add timeoffset and day to the start of that index you'll still reap the benefit from it *and* will only need to scan through the relevant rows - it's the "24 tables" approach in one easier to maintain table. :) Will be interested in hearing how this goes. – Will A Aug 08 '10 at 09:42

score 0 · Answer 2 · answered Aug 07 '10 at 22:58

0

I'm assuming you've went through all the partitioning considerations, such as partitioning by user.

I can see several solutions to your problem, depending on the usage pattern.

Aggregate data per day, per user selection. In the event of timezone change, programatically recalculate the aggregate for this partner. This is plausible if timezone changes are infrequent and if a certain delay in data may be introduced when a user changes timezones.
If you have relatively few measures, you may maintain 24 columns for each measure - each describing the daily aggregate for the measure in a different timezone.
If timezone changes are frequent and there are numerous measures, it seems like 24 different aggregate tables would be the way to go.

answered Aug 07 '10 at 22:58

shmichael

2,985
3
25
32

the timezone changes are, in fact, relatively few. i could programmatically recalculate the measures based on the change but the first change would have a significant delay. we have about 8 measures, 24 columns per measure would not be a good idea. im beginning to think that 24 tables is the way to go. i looked into @Will A's solution and it may be viable with a columnar db. but not with a db that degrades with the number of rows. – Russ Bradberry Aug 08 '10 at 03:21
192 integer columns isn't too bad, actually. And if you'll be using a columnar DB, I don't think you'll need any schema change whatsoever - at least not with the aforementioned problem in mind. – shmichael Aug 08 '10 at 12:50

score 0 · Answer 3 · answered Apr 26 '16 at 07:53

0

I met this problem too. I take this solution: the data with date type use local timezone, the other data with datetime type use UTC timezone, because the statistics index is local. Another reason is now we have only local data.

answered Apr 26 '16 at 07:53

tinyJason

1

score 0 · Answer 4 · answered Apr 18 '21 at 09:56

I facing same problem. I’m thinking about aggregating by date and time (hour by hour in UTC). Then you could fetch data accordingly for whatever time zone you want. Unfortunately this won’t work if you need to support timezones where there is 45/30/15 minute offset. Then you could aggregate data by 15minutes. Solution depends on the amount of data to agregate.

How Do I aggregate Data By Day and Still Respect Timezone?

4 Answers4

Linked