0

I have the following schema in BigQuery displaying population numbers and their respective countries:

country       STRING  
country_code  STRING
year_1960     INT
year_1961     INT
...
year_2019     INT

See here: bigquery-public-data.world_bank_global_population.population_by_country

So, I already found the countries with the biggest growth in the last 50 years. However, how would I be able to find out which respective year is their year with the strongest year on year growth?

**Usually window functions such as lag in combination with MAX would save the day, but here I would have to use one that goes over columns instead of rows. **

Is this possible somehow? Maybe using an array per country or using python in BigQuery?

Thanks in advance!

I tried to define the year 1960 as an offset of a lag function but this obviously does not work.

Jaytiger
  • 11,626
  • 2
  • 5
  • 15

1 Answers1

0

You might consider below approach.

CREATE TEMP FUNCTION json_keys(input STRING) RETURNS ARRAY<STRING> LANGUAGE js AS """
  return Object.keys(JSON.parse(input));
""";

CREATE TEMP FUNCTION json_values(input STRING) RETURNS ARRAY<STRING> LANGUAGE js AS """
  return Object.values(JSON.parse(input));
""";

WITH stacked_data AS (
  SELECT country, country_code, year, population,
         ROUND(CAST(population AS INT64) / LAG(CAST(population AS INT64)) OVER w - 1, 4) YoY
    FROM `bigquery-public-data.world_bank_global_population.population_by_country` t,
         UNNEST (json_keys(TO_JSON_STRING(t))) year WITH offset
    JOIN UNNEST (json_values(TO_JSON_STRING(t))) population WITH offset USING (offset)
   WHERE year NOT IN ('country', 'country_code')
  WINDOW w AS (PARTITION BY country, country_code ORDER BY year)
)
SELECT ANY_VALUE(t HAVING MAX YoY).*
  FROM stacked_data t 
 GROUP BY t.country, t.country_code;

Query results

enter image description here

See also:

Jaytiger
  • 11,626
  • 2
  • 5
  • 15
  • Thanks, this solution works! I am not that advanced yet, so I am wondering why is the TEMP function using JSON needed? – creative0101 Feb 13 '23 at 07:11
  • For unpivoting column names into row values, the query have to know column names (i.e., `year_1960`, `year_1961`) of a table in advance. If the number of columns in the table is small, you can write them down manually in the query but this is not doable for large number of columns. So, we need another way to do it programatically. `TO_JSON_STRING(t)` converts each row of table to a json string where key of json object is a `column name` and value of json object is a `column value` in it. – Jaytiger Feb 13 '23 at 08:13
  • Unfortunately BigQuery built-in functions doesn't provide a json function to extract json keys from a json string, therefore I defined `json_keys()` UDF in my answer to extract *column names* and unpivot them into values in `year` column of result set using it. – Jaytiger Feb 13 '23 at 11:02
  • Since the offset is not specified, I guess it just starts with the first available value, right? – creative0101 Feb 14 '23 at 05:57
  • sorry I didn't get what the *first available value* means. *offset* is being used only to combine (or join) *column name* and *column value* together. column name from `json_keys(TO_JSON_STRING(t)))` is `year` and a window frame for calculating YoY is sorted by `year`. so first value will be the population of earliest year of each county. if it's not what you've asked, kindly let me know. – Jaytiger Feb 14 '23 at 06:14