2

Consider the following tables

 =# \d users
 Column |         Type
--------+-----------------------
 id     | integer
 name   | character varying(32)

=# \d profiles
 Column  |  Type
---------+---------
 id      | integer
 user_id | integer

=# \d views
   Column   |            Type
------------+-----------------------------
 id         | integer
 profile_id | integer
 time       | timestamp without time zone

I need to find all users with an associated view in each month of a given date range. Currently I am doing the following:

with months as (
  select to_char(month, 'MM/YYYY') from generate_series('2014-07-01', '2014-09-01', INTERVAL '1 month') as month
)

select * from users
  join profiles on user_id = users.id
  join views on profile_id = profiles.id
    and to_char(views.time, 'MM/YYYY') in (select * from months)

I have setup a fiddle here.

Currently the results include the user Kyle who had no views in August and September. The correct result should only include the user Stan who had views in all the 3 months in the given range. How do we modify this query to return the desired result?

vyegorov
  • 21,787
  • 7
  • 59
  • 73
Alex
  • 1,618
  • 1
  • 17
  • 25

4 Answers4

2

Maybe this will be enough (I don't know Postgresql)

select u.id, u.name  from users u
  join profiles on user_id = users.id
  join views on profile_id = profiles.id
    and views.time between ? and ?
group by u.id, u.name
having count(distinct to_char(views.time, 'MM/YYYY')) = 3;
Multisync
  • 8,657
  • 1
  • 16
  • 20
  • @Alex instead of "3" use the difference between dates in months – Multisync Oct 25 '14 at 12:46
  • If you can produce a working query, it would be more acceptable. It need not be specific to postgres; write it any sql language. You may use the fiddle given in the description. It has support for other sql languages. – Alex Oct 26 '14 at 11:15
2

You seem to have an extended relational division, i.e. you're looking for users who had views in the given range only, although they might have views outside the range of interest also.

Along with GROUP BY, you can check this via EXCEPT construct. Basically, if you'll substract all months in your range with all the views within a given range, you should receive no rows:

WITH months(month) AS (
  SELECT DATE '2014-07-01' + m*INTERVAL'1mon'
    FROM generate_series(0,2) m
)
SELECT *
  FROM users    u
  JOIN profiles p ON p.user_id=u.id
  JOIN views    v ON v.profile_id=p.id
 WHERE 0 = (SELECT count(*) FROM (
    SELECT month FROM months
    EXCEPT ALL
    SELECT date_trunc('mon',time) FROM views
     WHERE date_trunc('mon',time) IN (SELECT * FROM months)
       AND profile_id=p.id) minus);

You can slightly simplify this construct via = ALL construct, as it will return true in the case when subquery returns no rows:

WITH months(month) AS (
  SELECT DATE '2014-07-01' + m*INTERVAL'1mon'
    FROM generate_series(0,2) m
)
SELECT *
  FROM users    u
  JOIN profiles p ON p.user_id=u.id
  JOIN views    v ON v.profile_id=p.id
 WHERE date_trunc('mon',time) = ALL (
    SELECT month FROM months
    EXCEPT ALL
    SELECT date_trunc('mon',time) FROM views
     WHERE date_trunc('mon',time) IN (SELECT * FROM months)
       AND profile_id=p.id);

A quote from the manual on ALL:

The result of ALL is "true" if all rows yield true (including the case where the subquery returns no rows).

Both my queries are effectively the same. The first one counts number of rows in the inner side and compares them to zero (and I agree, this is more obvious). The second one compares current views.time to all the results of the subqueries. This construct yields true only if all entries returned by the subquery equals to the views.time (of course, truncated to the month boundary). And, as quoted, this construct yields true also if subquery returns no rows.

And by intent, subquery should yield no rows, which indicates that all views happened within the desired time range.

Check on SQL Fiddle

vyegorov
  • 21,787
  • 7
  • 59
  • 73
  • Your answer seems to scale well when more rows are added. I do not understand the second example, so I am thinking of going with the first one. Will the second example give me the same results as the first one in all cases? – Alex Oct 26 '14 at 11:18
  • In the first example, I modified the last 3 lines like so: `SELECT date_trunc('mon',time) FROM views WHERE profile_id=p.id) minus);` and the query still worked. Is there any reason to keep that extra where condition? – Alex Oct 26 '14 at 11:27
  • @Alex, yes, constructs are the same. And you're right, it is enough to keep just the bit you've outlined. Perhaps, on a bigger tables, it might be useful, though, as gives more possibilities for index scans. – vyegorov Oct 26 '14 at 11:45
1
with months (month, month_count) as (
  select to_char(month, 'MM/YYYY'), 
         count(*) over ()
  from generate_series('2014-07-01', '2014-09-01', INTERVAL '1' month) as month
), counted as (
  select *, 
         count(*) over (partition by user_id) as cnt
  from users
    join profiles on user_id = users.id
    join views on profile_id = profiles.id
    join months on months.month = to_char(views.time, 'MM/YYYY')
) 
select * 
from counted
where cnt = month_count

The first CTE now includes the count of all months, the second CTE then counts the months for each user and is necessary because the window function can not be put directly into the where clause. The final select then eliminates the users which don't have all the months.

  • Dependence on count comparision is problematic since it will fail when there is more than 1 view in a single month. If we add another row to the views with `insert INTO views (profile_id, time) values (1, '2014-08-15')`, this query will fail. I like this approach and if it can be modified to handle multiple views in the same month, it would be great. – Alex Oct 26 '14 at 06:55
0

To be sure it's not only the count that matches, but also the contents you can do an array comparison after converting the month series, and the views to an array.

  • The first cte we know
  • The second cte creates a grouped series of users, and an array of their distinct months visited
  • The last query compares the two arrays

Example:

with months as (
    select array_agg(to_char(month, 'MM/YYYY')) m
    from generate_series('2014-07-01', '2014-09-01', INTERVAL '1 month') as month
), user_months as (
    select p.user_id, array_agg(distinct to_char(v.time, 'MM/YYYY')) m
    from views v
    inner join profiles p on p.id = v.profile_id
    group by p.user_id
)
select um.*
from user_months um
inner join months m on m.m = um.m;
atom.gregg
  • 987
  • 8
  • 14
  • This query gives me an aggregate result instead of the tabular results that other responses provide. I would prefer the non aggregrated results. – Alex Oct 26 '14 at 11:21