Select n distinct values from one column and all other data

Question

I have a table that logs one row of data per user per day with a user_id column. I would like to select all days from n random users, but the table is incredibly large, so I'd like to avoid doing a self join.

Right now, I'm doing:

SELECT
    a.user_id, b.col1, b.col2, b.col3
FROM
    (
        SELECT DISTINCT
            user_id
        FROM
            pipeline.user_daily
        ORDER BY
            RANDOM() LIMIT 100) a
LEFT JOIN
    pipeline.user_daily b
ON
    a.user_id = b.user_id

Is there any way to do this with a window function or any other non self join?

On an unrelated note, the above inner table a doesn't actually work in vertica, so I use

SELECT
    user_id
FROM
    pipeline.user_daily
GROUP BY
    user_id
ORDER BY
    RANDOM() LIMIT 100

Any thoughts on why select distinct order by random doesn't work in Vertica?

Don't you have a (presumably much smaller) user table? You can try to pick 100 random rows from it and join on these. Do you have an index on `user_daily.user_id`? If not, create one and try if that helps also. — sticky bit, Sep 21 '18 at 19:38
If you are concerned about performance: what is the position of user_id in the order by of the projection that is being used? — minatverma, Sep 22 '18 at 03:56

score 0 · Answer 1 · answered Sep 21 '18 at 19:45

0

I am not sure if this is considered a self Join but it should be optimized by SQL server that firstly the SELECT DISTINCT user_id query runs and then the outer one.

SELECT user_id, col1, col2, col3 FROM pipeline.user_daily
WHERE
user_id IN
(SELECT DISTINCT user_id
        FROM
            pipeline.user_daily
        ORDER BY
            RANDOM() LIMIT 100)

answered Sep 21 '18 at 19:45

Bakri Bitar

1,543
18
29

1

Thanks for the suggestion! So I looked it up, and according to this article they end up being the exact same query plan and usage statistics. Thanks for the idea though! https://explainextended.com/2009/06/16/in-vs-join-vs-exists/ – jtanman Sep 21 '18 at 20:01

score 0 · Answer 2 · answered Sep 21 '18 at 20:08

0

Try using a WITH clause. I'm unfamiliar with Vertica but this works in Oracle.

  WITH a AS (
       SELECT DISTINCT
                user_id
            FROM
                pipeline.user_daily
            ORDER BY
                RANDOM() LIMIT 100 )
    SELECT
        a.user_id, b.col1, b.col2, b.col3
    FROM pipeline.user_daily b
    JOIN a on a.user_id = b.user_id

answered Sep 21 '18 at 20:08

stickabee

145
8

1

Yeah so from my understanding WITH is just a text replacement, so it'd be the same query and computation time anyways. Thanks for the idea though! – jtanman Sep 21 '18 at 21:05

Constantine · Answer 3 · 2018-09-25T04:27:44.667

0

You can use row_number() to get distinct.

SELECT
    user_id, col1, col2, col3
FROM (
    SELECT
        user_id, col1, col2, col3, row_number() over (partition by user_id, col1, col2, col3 order by null) rn
    FROM 
        pipeline.user_daily
) sub
where rn = 1

edited Sep 25 '18 at 04:27

answered Sep 25 '18 at 04:02

Constantine

1,356
14
19

I'm trying to get all rows of a limited number of distinct user_ids without doing a self join. For example, if the data has 1000 distinct users and each user has a variable number of rows (1-100), I would like all rows for 100 distinct random users. – jtanman Dec 12 '18 at 07:11

Select n distinct values from one column and all other data

3 Answers3