I have a table which contains analytical information, i.e.: Page Views on each page.
field type
---------------------------
page_id long
created_time long (epoch UTC - rounded by hour)
page_views long
I round down the epoch to hour (ex: 1398456553 ==> 1398456000) allowing this table to have aggregated information per hour.
When a client request their data, we can make the proper adjustments allowing him/her to see data in their local timezone.
If client's local timezone is UTC, the query is simple:
SELECT
FROM_UNIXTIME(st.`created_time`, '%Y-%m-%d') AS created_at,
SUM(st.`page_views`) AS page_views
FROM `page_stats` st
WHERE st.`created_time`
BETWEEN 1396310400 -- 01 Apr 2014 00:00:00 GMT
AND 1397088000 -- 10 Apr 2014 00:00:00 GMT
GROUP BY created_at;
If client's timezone is someplace else (ex: -03:00), the query requires a little bit more manipulation, to adjust the dates to correct TZ:
SELECT
DATE_FORMAT(CONVERT_TZ(FROM_UNIXTIME(st.`created_time`), '+00:00', '-03:00'), '%Y-%m-%d') AS created_at,
SUM(st.`page_views`) AS page_views
FROM `page_stats` st
WHERE st.`created_time`
BETWEEN 1396321200 -- 01 Apr 2014 03:00:00 GMT
AND 1397098800 -- 10 Apr 2014 03:00:00 GMT
GROUP BY created_at;
This approach works just fine for small periods (< 30days), but it scales poorly when the date range represents several months, because the number of rows to be selected and also because transformation need by functions like DATE_FORMAT
.
The ideal data granularity is DAY, but I can't create an aggregated table by day because the rollup by day differs on each TZ.
What should be the proper way to model tables to provide TZ fidelity on large datasets?
Its noteworthy I can allow some error (< 2%) on this group by, maybe some Probabilist Data Structure may help to solve the problem but I couldn't figure out yet.