2

Facing a query design issue and not sure about whether my approach to the problem is unnecessarily complicated:

I have a fact table:

       Column   |            Type             |                       Modifiers                       
------------+-----------------------------+-------------------------------------------------------
 id         | integer                     | not null default nextval('messages_id_seq'::regclass)
 type       | character varying(255)      | 
 ts         | numeric                     | 
 text       | text                        | 
 score      | double precision            | 
 user_id    | integer                     | 
 channel_id | integer                     | 
 time_id    | integer                     | 
 created_at | timestamp without time zone | 
 updated_at | timestamp without time zone | 

I am running some analytical queries against it currently, one of which (for example) would be:

  with intervals as (
  select 
    (select '09/27/2014'::date) + (n      || ' minutes')::interval start_time,
    (select '09/27/2014'::date) + ((n+60) || ' minutes')::interval end_time
      from generate_series(0, (24*60*7), 60 * 4) n
  )
  select 
    extract(epoch from i.start_time)::numeric * 1000 as ts, 
    extract(epoch from i.end_time)::numeric * 1000 as end_ts,
    sum(avg(messages.score)) over (order by i.start_time) as score

  from messages
  right join intervals i
    on messages.timestamp >= i.start_time and messages.timestamp < i.end_time

  where messages.timestamp between '09/27/2014' and '10/04/2014'

  group by i.start_time, i.end_time 
  order by i.start_time

As you guys can probably tell - this query computes the average of the "score" attribute for messages for a given time-bucket distribution and then alongside that computes a cumulative across the buckets (using the window).

What I am trying to do next is find the top 5 (for example) messages.text that are closest to the average for each bucket.

Right now, the only plan I have is to:

1) Join messages with the time-buckets
2) Compute a score - avg(score) over (partition by start_time) as deviation and save it against each record of the joined relation
3) Compute a rank() over (order by deviation) as rank
4) Select where rank between 1 and 5

The reason I have put this down imperatively in steps in because my first attempt at coming up with a design involved using a window function within a window function (rank() over (partition by start_time, order by score - avg(score) over (partition by start_time)) and I wasn't even going to attempt that to see if it would work.

Can I please get some advice about whether I'm headed in the right direction?

Slania
  • 91
  • 1
  • 7
  • Note: `generate_series()` works with timestamps, too. `generate_series('2014-09-27', '2014-10-04', '1 hour'::interval )` will probably do what you want. – wildplasser Oct 05 '14 at 10:40
  • Correction: That should have been `generate_series('2014-09-27 00:00:00', '2014-10-04 00:00:00', '1 hour'::interval )` – wildplasser Oct 05 '14 at 11:29
  • @wildplasser ah, yes, you're right - that's a good refactoring suggestion, I'll fix that! ^_^ – Slania Oct 05 '14 at 14:25

2 Answers2

0

The direction you should be heading (this is only my suggestion):

  1. Get the average score (over all records)
  2. Operation MINUS on (row score, avg(score))

-- This will leave you with values also positive and negative

  1. Use abs() on each operation from step 2, in the same computation
  2. Use rank() and order them approprietly
  3. WHERE rank BETWEEN 1 AND 5
Kamil Gosciminski
  • 16,547
  • 8
  • 49
  • 72
0

Whelp - here is what I have and seems to work:

Now open for criticism is the structuring of, performance optimizations of and redundancy in my query! ^_^ (minus generating the timeseries directly instead of all the twisted intervals math which I will fix eventually!)

with intervals as (
    select 
        (select '09/29/2014'::date) + (n      || ' minutes')::interval start_time,
        (select '09/29/2014'::date) + ((n+60) || ' minutes')::interval end_time
        from generate_series(0, (24*60*7), 60 * 4) n
), intervaled_messages as (
    select
        extract(epoch from i.start_time)::numeric * 1000 as ts, 
        extract(epoch from i.end_time)::numeric * 1000 as end_ts,
        abs(score - avg(score) over (partition by i.start_time)) as deviation
    from messages
    right join intervals i
        on messages.timestamp >= i.start_time and messages.timestamp < i.end_time
    where messages.timestamp between '09/29/2014' and '10/06/2014'
), ranked_messages as (
    select ts, end_ts, deviation, 
    rank() over (partition by ts order by deviation) as rank,
    row_number() over (partition by ts order by deviation) as row_number
    from intervaled_messages
)
select ts, end_ts, deviation, rank 
from ranked_messages 
where rank between 1 and 5
  and row_number between 1 and 5
order by ts;
Slania
  • 91
  • 1
  • 7