Self join in big query is running very slowly, am I following best practices?

Question

I'm creating a table of the number of overlapping commentators between Reddit subreddits via the following self join:

SELECT t1.subreddit, t2.subreddit, COUNT(*) as NumOverlaps
FROM [fh-bigquery:reddit_comments.2015_05] t1
JOIN [fh-bigquery:reddit_comments.2015_05] t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit;

Typical queries I have made on this data set in Big Query finish very quickly (< 1 min) but this query has run for over an hour and still not finished. This data has 54,504,410 rows and 22 columns.

Am I missing obvious speed ups that I should be implementing to make this query run quickly? Thanks!

Mikhail Berlyant · Accepted Answer · 2016-11-12T21:38:10.633

try below

SELECT t1.subreddit, t2.subreddit, SUM(t1.cnt*t2.cnt) as NumOverlaps
FROM (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t1
JOIN (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit

It does two things
First, it pre-aggregates data to avoid redundant joining
Second, it eliminates "potential outliers" - those authors who have just one post for subreddit. Of course, second item depends on your use case. But most likely it should be ok and thus resolving performance issue. if still slower than you expected - increase threshold to 2 or greater

Follow up: ... 22,545,850,104 ... seems incorrect ... Should it be SUM(t1.cnt+t2.cnt)?

Sure it is incorrect, but it is exactly what you would get if you were able to run your query in question!
And my hope was that you will be able to catch this!
So, I am glad that fixing “performance” issue – opened your eyes on issue with logic in your original query!

So, yes, obviously 22,545,850,104 is incorrect number.
So, instead of

    SUM(t1.cnt*t2.cnt) as NumOverlaps

you should use simple

    SUM(1) as NumOverlaps as NumOverlaps

This will give you result that would be equivalent of using

    EXACT_COUNT_DISTINCT(t1.author) as NumOverlaps

in your original query

So, try below now:

SELECT t1.subreddit, t2.subreddit, SUM(1) as NumOverlaps
FROM (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t1
JOIN (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit

One question - when I run this code (for example with cnt > 5) it returns that the subreddits AskReddit and leagueoflegends have the most overlaps with 22,545,850,104 joint commentators (authors with 5 comments in each subreddit). This seems incorrect as there are only 54,504,410 total comments in the table? Should it be SUM(t1.cnt+t2.cnt)? Thanks! — Trevor M., Nov 12 '16 at 20:39
Awesome, thanks! I appreciate the mini lesson since I'm new to SQL in general actually. — Trevor M., Nov 12 '16 at 23:27

Self join in big query is running very slowly, am I following best practices?

1 Answers1