0

I'm quite new to SQL & big query so this might be simple. I'm running some queries on the public dataset GDELT in BQ and have a question regarding the LIMIT. GDELT is massive (14.4 TB) and when I query for something, in this case a person, I could get up to 100k rows of results or more which is this case is too much. But when I use LIMIT it seems like it does not really partition the results evenly over the dates, causing me to get very random timelines. How does limit work and is there a way to get the results more evenly based on days?

SELECT DATE,V2Tone,DocumentIdentifier as URL, Themes, Persons, Locations
FROM `gdelt-bq.gdeltv2.gkg_partitioned`
WHERE DATE>=20210610000000 and _PARTITIONTIME >= TIMESTAMP(@start_date)
AND DATE<=20210818999999 and _PARTITIONTIME <= TIMESTAMP(@end_date)
AND LOWER(DocumentIdentifier) like @url_topic
LIMIT @limit

When running this query and doing some preproc, I get the following time series: enter image description here

It's based on 15k results, but they are distributed very unevenly/randomly across the days (since there are over 500k results in total if I don't use limit). I would like to make a query that limits my output to 15k but partitions the data somewhat equally over the days.

Jh123
  • 83
  • 8

1 Answers1

0

you need to order by , when you are not sorting your result , the order of returned result is not guaranteed:

but if you are looking to get the same number of rows per day , you can use window functions:

select * from (
SELECT
    DATE,
    V2Tone,
    DocumentIdentifier as URL,
    Themes,
    Persons,
    Locations,
    row_number() over (partition by DATE) rn
FROM `gdelt-bq.gdeltv2.gkg_partitioned`
WHERE
    DATE >= 20210610000000 AND DATE <= 20210818999999
    and _PARTITIONDATE >= @start_date and _PARTITIONDATE <= @end_date
    AND LOWER(DocumentIdentifier) like @url_topic
) t where rn = @numberofrowsperday 

if you are passing date only you can use _PARTITIONDATE to filter the partitions.

eshirvana
  • 23,227
  • 3
  • 22
  • 38
  • But how do I then get a partition of the dates in an even interval? If I sort by dates I'll only get the first few results. Since the data set is so large if I only get 3 days despite filtering for 3 months. (keeping in mind that I'd rather not make sub queries if possible to avoid large BQ costs) – Jh123 Aug 18 '21 at 18:09
  • @Jh123 you need to procide more information, edit your question and add sample data and desired output . also remove screenshot and paste your code in text mode – eshirvana Aug 18 '21 at 19:12
  • I've added the plot of the data I get with the query, it's a little similar to the issue https://stackoverflow.com/questions/24630635/include-all-the-days-when-using-limit – Jh123 Aug 18 '21 at 20:02
  • @Jh123 so If I understand correctly , you want limited but same number of rows per day? – eshirvana Aug 20 '21 at 14:38