-1

I have a table that logs one row of data per user per day with a user_id column. I would like to select all days from n random users, but the table is incredibly large, so I'd like to avoid doing a self join.

Right now, I'm doing:

SELECT
    a.user_id, b.col1, b.col2, b.col3
FROM
    (
        SELECT DISTINCT
            user_id
        FROM
            pipeline.user_daily
        ORDER BY
            RANDOM() LIMIT 100) a
LEFT JOIN
    pipeline.user_daily b
ON
    a.user_id = b.user_id

Is there any way to do this with a window function or any other non self join?

On an unrelated note, the above inner table a doesn't actually work in vertica, so I use

SELECT
    user_id
FROM
    pipeline.user_daily
GROUP BY
    user_id
ORDER BY
    RANDOM() LIMIT 100

Any thoughts on why select distinct order by random doesn't work in Vertica?

jtanman
  • 654
  • 1
  • 4
  • 18
  • 1
    Don't you have a (presumably much smaller) user table? You can try to pick 100 random rows from it and join on these. Do you have an index on `user_daily.user_id`? If not, create one and try if that helps also. – sticky bit Sep 21 '18 at 19:38
  • If you are concerned about performance: what is the position of user_id in the order by of the projection that is being used? – minatverma Sep 22 '18 at 03:56

3 Answers3

0

I am not sure if this is considered a self Join but it should be optimized by SQL server that firstly the SELECT DISTINCT user_id query runs and then the outer one.

SELECT user_id, col1, col2, col3 FROM pipeline.user_daily
WHERE
user_id IN
(SELECT DISTINCT user_id
        FROM
            pipeline.user_daily
        ORDER BY
            RANDOM() LIMIT 100)
Bakri Bitar
  • 1,543
  • 18
  • 29
  • 1
    Thanks for the suggestion! So I looked it up, and according to this article they end up being the exact same query plan and usage statistics. Thanks for the idea though! https://explainextended.com/2009/06/16/in-vs-join-vs-exists/ – jtanman Sep 21 '18 at 20:01
0

Try using a WITH clause. I'm unfamiliar with Vertica but this works in Oracle.

  WITH a AS (
       SELECT DISTINCT
                user_id
            FROM
                pipeline.user_daily
            ORDER BY
                RANDOM() LIMIT 100 )
    SELECT
        a.user_id, b.col1, b.col2, b.col3
    FROM pipeline.user_daily b
    JOIN a on a.user_id = b.user_id
stickabee
  • 145
  • 8
  • 1
    Yeah so from my understanding WITH is just a text replacement, so it'd be the same query and computation time anyways. Thanks for the idea though! – jtanman Sep 21 '18 at 21:05
0

You can use row_number() to get distinct.

SELECT
    user_id, col1, col2, col3
FROM (
    SELECT
        user_id, col1, col2, col3, row_number() over (partition by user_id, col1, col2, col3 order by null) rn
    FROM 
        pipeline.user_daily
) sub
where rn = 1
Constantine
  • 1,356
  • 14
  • 19
  • I'm trying to get all rows of a limited number of distinct user_ids without doing a self join. For example, if the data has 1000 distinct users and each user has a variable number of rows (1-100), I would like all rows for 100 distinct random users. – jtanman Dec 12 '18 at 07:11