1

I would like to use BigQuery instead of Pandas to create dummy variables (one-hot-encoding using multilabelbinarizer) for my categories. I have large number of columns, therefore I can't do it manually and hard code it

Test dataset (the actual one has many more variables than this one)

WITH table AS (
SELECT 1001 as ID, ['blue','green'] As Color, ['big'] AS size UNION ALL
SELECT 1002 as ID, ['green','yellow'] As Color, ['medium','large'] AS size UNION ALL
SELECT 1003 as ID, ['red'] As Color, ['big'] AS size UNION ALL
SELECT 1004 as ID, ['blue'] As Color, ['big'] AS size)

SELECT *
FROM table

EXPECTED output

output

I wish to store it as a table/dataframe as shown in the image. I have more columns like color,size, products, etc.

Related answer(not a list): one-hot-encoding (dummy variables) with BigQuery

PRData
  • 31
  • 4

2 Answers2

1

Below query will return your expected output.

WITH table AS (
  SELECT 1001 as ID, ['blue','green'] As Color, ['big'] AS size UNION ALL
  SELECT 1002 as ID, ['green','yellow'] As Color, ['medium','large'] AS size UNION ALL
  SELECT 1003 as ID, ['red'] As Color, ['big'] AS size UNION ALL
  SELECT 1004 as ID, ['blue'] As Color, ['big'] AS size
)
SELECT * FROM (
  SELECT ID, type, value FROM table UNPIVOT (values FOR type IN (Color, size)), UNNEST(values) value
) PIVOT (COUNT(1) FOR type || '_' || value IN (
    'Color_blue', 'Color_green', 'Color_yellow', 'Color_red', 'size_big', 'size_medium', 'size_large'
  ));

Query results

enter image description here

Based on @Mikhail's answer using a dynamic sql, you can partially generalize the query. (column names are still hard-coded.)

DECLARE Colors, Sizes ARRAY<STRING>;

CREATE TEMP TABLE sample_table AS (
  SELECT 1001 as ID, ['blue','green'] As Color, ['big'] AS size UNION ALL
  SELECT 1002 as ID, ['green','yellow'] As Color, ['medium','large'] AS size UNION ALL
  SELECT 1003 as ID, ['red'] As Color, ['big'] AS size UNION ALL
  SELECT 1004 as ID, ['blue'] As Color, ['big'] AS size
);

SET (Colors, Sizes) = (
  SELECT AS STRUCT ARRAY_AGG(DISTINCT IF(type = 'Color', value, NULL) IGNORE NULLS),
                   ARRAY_AGG(DISTINCT IF(type = 'size', value, NULL) IGNORE NULLS),
    FROM `your-project.your-dataset.input_table` UNPIVOT (values FOR type IN (Color, size)), UNNEST(values) value
);

EXECUTE IMMEDIATE FORMAT("""
  CREATE OR REPLACE TABLE `your-project.your-dataset.output_table` AS
  SELECT * FROM (
    SELECT ID, type, value FROM `your-project.your-dataset.input_table` UNPIVOT (values FOR type IN (Color, size)), UNNEST(values) value
  ) PIVOT (COUNT(1) FOR type || '_' || value IN (%s,%s)) ORDER BY ID;
""", (SELECT STRING_AGG(FORMAT("'Color_%s'", color)) FROM UNNEST(Colors) color),
     (SELECT STRING_AGG(FORMAT("'size_%s'", size)) FROM UNNEST(Sizes) size)
);
Jaytiger
  • 11,626
  • 2
  • 5
  • 15
  • Thanks alot :) I am new to dynamic sql. Can you help me in understanding the second query. The first query is simple but i need to hardcode all the values in the column which is difficult in my case as I have around 50 distinct values in each column. Also, can you help me with storing the final table into a dataframe/table – PRData Jan 05 '23 at 16:13
  • @PRData, you can use *CTAS(Create Table AS)* to store the query results. I've updated the answer. `CREATE OR REPLACE TABLE \`your-project.your-dataset.your_table\` AS` – Jaytiger Jan 05 '23 at 23:26
  • Thanks ..that works ..one more thing: How to read data from a table instead of the TEMP TABLE...like select * from `my-project.my_table` in the above code – PRData Jan 06 '23 at 14:31
  • @PRData, just to replace all the `sample_table` in the query with your permanent table name like `\`your-project.your-dataset.input_table\`` will work, I think. – Jaytiger Jan 06 '23 at 14:53
  • and make sure that `\`your-project.your-dataset.input_table`\` table schema is same as `sample_table` – Jaytiger Jan 06 '23 at 14:55
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/250901/discussion-between-prdata-and-jaytiger). – PRData Jan 06 '23 at 15:11
  • @PRData, above chat site is not accessible at the work. – Jaytiger Jan 07 '23 at 00:40
  • Just asked you to help in understanding the code – PRData Jan 09 '23 at 10:40
0

Consider below approach - most generic I can think of and does not require any knowledge about columns number and names

create temp function extract_keys(input string) returns array<string> language js as """
return Object.keys(JSON.parse(input));""";

create temp function extract_values(input string) returns array<string> language js as """
return Object.values(JSON.parse(input));""";

create temp table flatten_list as (
  select id, format('%s_%s', key, val) col from your_table t, 
  unnest([to_json_string((select as struct * except(id) from unnest([t])))]) json, 
  unnest(extract_keys(json)) key with offset
  join unnest(extract_values(json)) vals with offset
  using (offset), unnest(split(vals)) val
);

execute immediate format(
  'create temp table pivot_table as select * from flatten_list pivot (count(*) for col in (%s)) order by id', 
  (select string_agg("'" || col || "'", ","  order by col) 
  from (select distinct col from flatten_list)) 
);

select * from pivot_table;

if applied to sample data in your question - output is

enter image description here

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230