2

Okay, I am new to SQL and Big Query and got an ambiguous column name error. I have checked the other answers on stack overflow, but could not find/understand an answer to my problem. So I get: Error: 2.40 - 2.68: Ambiguous column name subreddit.

For this code (which I adapted from another person's analysis of a similar thing):

#legacySQL

    # Creating list of number of users who authored at least 10 posts in pairs of subreddits: 
    SELECT t1.subreddit, t2.subreddit, SUM(1) as NumOverlaps
    FROM (SELECT subreddit, author, COUNT(1) as cnt 
         FROM (TABLE_QUERY([fh-bigquery:reddit_comments],
     'table_id CONTAINS "2017_" AND length(table_id) >= 5'))
         GROUP BY subreddit, author HAVING cnt > 5) t1

    JOIN (SELECT subreddit, author, COUNT(1) as cnt 
         FROM(TABLE_QUERY([fh-bigquery:reddit_comments],
     'table_id CONTAINS "2017_" AND length(table_id) >= 5'))
         GROUP BY subreddit, author HAVING cnt > 5) t2

    ON t1.author=t2.author
    WHERE t1.subreddit!=t2.subreddit
    GROUP BY t1.subreddit, t2.subreddit

Thanks for your help!

AAP
  • 21
  • 1
  • 3
  • I don't think this query is generating that error. – Gordon Linoff Mar 05 '18 at 13:10
  • But that is what it tells me, I copied the code straight from the query. It allows me to run the Query but then stops in 1 min and gives me this error. – AAP Mar 05 '18 at 13:50
  • Can you provide job id - I wasn't able to reproduce error with that query. – Mosha Pasumansky Mar 05 '18 at 15:00
  • Job ID: reddit-analysis-of-subreddits:US.bquijob_50eb6a1e_161f63505d6 That is where it is to be saved I think – AAP Mar 05 '18 at 15:03
  • Update: If I specify a destination table for the results I get the 'Error: 2.40 - 2.68: Ambiguous column name subreddit.'. If I do not specify a destination table, it runs for about an hour and then tells me 'Response too large to return. Consider setting destinationTable or (for legacy SQL queries) setting allowLargeResults to true in your job configuration'. Ideas? – AAP Mar 05 '18 at 21:22

2 Answers2

0

The error you got must have had a transient cause (probably a BigQuery cache issue) as I successfully run the same query with Allowing Large Results, Flatten results enabled and specifying a Destination table. Results, though, were not correct as all subreddits were listed. The cause must be the use of the Join clause which joins 2 identical tables (that could have also triggered the ambiguous column name error) and BigQuery is obliged to do a cross join hence the multiplied results.

I recommend you to create a table with the results of the targeted public dataset as follows:

SELECT subreddit, author, COUNT(1) as cnt FROM(TABLE_QUERY([fh-bigquery:reddit_comments], 'table_id CONTAINS "2017_" AND length(table_id) >= 5')) GROUP BY subreddit, author HAVING cnt > 5'

And then apply a different query to obtain your desired results using just one table (the one resulted from querying the public dataset). It's advisable to optimize your query and avoid SQL anti-patterns.

MonicaPC
  • 379
  • 4
  • 15
0

Your select statement SELECT t1.subreddit, t2.subreddit, SUM(1) as NumOverlaps introduces three fields in output - first two of which are going to have the same name - which is subreddit thus the error message Ambiguous column name subreddit.

To address the the ambiguity just use aliases like in below example
SELECT t1.subreddit as t1_subreddit, t2.subreddit as t2_subreddit, SUM(1) as NumOverlaps

That simple! So, assuming that this was the only issue in your query - it should work now!

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230