0
SELECT
DATE,
EXTRACT(YEAR FROM DATE) AS year,
FIPS as Country,
LOCATIONS,
AVG(TONE) as Avg_Tone,
AVG(Positive Score) as PositiveS,
AVG(Negative Score) as NegativeS,
COUNT(*),
From `gdelt-bq.gdeltv2.gkg_partitioned`,
`gdelt-bq.extra.sourcesbycountry` country,

Where
DATE(_PARTITIONTIME) BETWEEN TIMESTAMP('2002-01-01') AND TIMESTAMP('2020-12-31')
AND SourceCommonName=country.Domain
AND Location like '%CH%'
GROUP BY Year,Country
ORDER BY  Year,Country

Codebook link is http://data.gdeltproject.org/documentation/GDELT-Global_Knowledge_Graph_Codebook-V2.1.pdf The V1.5TONE has TONE, Positive Score and Negative Score and so on. I want to count the average Tone by year. How to get it from Big Query.

janeluyip
  • 17
  • 5
  • Please show in the future some more of your work before asking. Providing a query which provides all needed columns is a must and it should be runnable or state which part is the problem. The V1.5TONE is not a column, it is the V2Tone, right? Also think of reducing the quered data amount if possible (for a test to help you, I do not want to spend 270 GB query data) – Samuel Mar 20 '22 at 17:13

1 Answers1

1

There is the need to cast and split the fields first.

  • The date is a value formated as "yyyymmdd....". Therefore, I suggest casting the value as a string and consider the first four characters as the year.

  • There is no V1.5TONE column, but V2Tone. It consists of a string with several digital numbers seperated by a comma. The string has to be split first. Then each component needs to be casted a decimal number.

  • The table gdelt-bq.extra.sourcesbycountry should map an url to a single country. It has duplicated countries to one url! To eliminate at least some duplicated value the inner select with the group by is used.

  • To obtain the values grouped by Year,Country all other dimensional columns need to be commented out.

SELECT
#DATE,
substr(cast(date as string),0,4) AS year,
FIPS as Country,
#LOCATIONS,
AVG(cast(split(V2Tone,",")[safe_offset(0)] as decimal )) as Avg_Tone,
AVG(cast(split(V2Tone,",")[safe_offset(1)] as decimal )) as PositiveS,
AVG(cast(split(V2Tone,",")[safe_offset(2)] as decimal )) as NegativeS,
COUNT(*) as counts,
From `gdelt-bq.gdeltv2.gkg_partitioned`
left join
(SELECT Domain, FIPS 
from `gdelt-bq.extra.sourcesbycountry`
group by 1,2) country
on  SourceCommonName=country.Domain

Where
DATE(_PARTITIONTIME) BETWEEN DATE('2020-01-01') AND DATE('2020-01-31')
AND Locations like '%CH%'
GROUP BY Year,Country
ORDER BY  Year,Country

Also in where segment: Do not mix DATE and TIMESTAMP.

Samuel
  • 2,923
  • 1
  • 4
  • 19
  • The data cannot get before 2015 year? But in website https://blog.gdeltproject.org/the-datasets-of-gdelt-as-of-february-2016/ the GKG have been extend back to 1979 by Fall 2016. – janeluyip Mar 22 '22 at 14:31
  • This dataset starts at 2015-02-19 (2015-02-17 has only 25 rows) `SELECT _PARTITIONTIME AS Time, COUNT(1) FROM `gdelt-bq.gdeltv2.gkg_partitioned` group by 1 order by 1 ` – Samuel Mar 24 '22 at 14:43