try below
SELECT t1.subreddit, t2.subreddit, SUM(t1.cnt*t2.cnt) as NumOverlaps
FROM (SELECT subreddit, author, COUNT(1) as cnt
FROM [fh-bigquery:reddit_comments.2015_05]
GROUP BY subreddit, author HAVING cnt > 1) t1
JOIN (SELECT subreddit, author, COUNT(1) as cnt
FROM [fh-bigquery:reddit_comments.2015_05]
GROUP BY subreddit, author HAVING cnt > 1) t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit
It does two things
First, it pre-aggregates data to avoid redundant joining
Second, it eliminates "potential outliers" - those authors who have just one post for subreddit. Of course, second item depends on your use case. But most likely it should be ok and thus resolving performance issue. if still slower than you expected - increase threshold to 2 or greater
Follow up: ... 22,545,850,104 ... seems incorrect ...
Should it be SUM(t1.cnt+t2.cnt)?
Sure it is incorrect, but it is exactly what you would get if you were able to run your query in question!
And my hope was that you will be able to catch this!
So, I am glad that fixing “performance” issue – opened your eyes on issue with logic in your original query!
So, yes, obviously 22,545,850,104 is incorrect number.
So, instead of
SUM(t1.cnt*t2.cnt) as NumOverlaps
you should use simple
SUM(1) as NumOverlaps as NumOverlaps
This will give you result that would be equivalent of using
EXACT_COUNT_DISTINCT(t1.author) as NumOverlaps
in your original query
So, try below now:
SELECT t1.subreddit, t2.subreddit, SUM(1) as NumOverlaps
FROM (SELECT subreddit, author, COUNT(1) as cnt
FROM [fh-bigquery:reddit_comments.2015_05]
GROUP BY subreddit, author HAVING cnt > 1) t1
JOIN (SELECT subreddit, author, COUNT(1) as cnt
FROM [fh-bigquery:reddit_comments.2015_05]
GROUP BY subreddit, author HAVING cnt > 1) t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit