3

I would like to use BigQuery instead of Pandas to create dummy variables (one-hot-encoding) for my categories. I will end up with about 200 columns, therefore I can't do it manually and hard code it

Test dataset (the actual one has many more variables than this one)

WITH table AS (
SELECT 1001 as ID, 'blue' As Color, 'big' AS size UNION ALL
SELECT 1002 as ID, 'yellow' As Color, 'medium' AS size UNION ALL
SELECT 1003 as ID, 'red' As Color, 'small' AS size UNION ALL
SELECT 1004 as ID, 'blue' As Color, 'small' AS size)

SELECT *
FROM table

enter image description here

Expected result:

enter image description here

Alex
  • 1,447
  • 7
  • 23
  • 48

1 Answers1

4

Below is for BigQuery Standard SQL

DECLARE Colors, Sizes ARRAY<STRING>;

SET (Colors, Sizes) = (SELECT AS STRUCT ARRAY_AGG(DISTINCT Color), ARRAY_AGG(DISTINCT Size) FROM `project.dataset.table`);

EXECUTE IMMEDIATE '''
CREATE TEMP TABLE result AS  -- added line
SELECT ID, ''' || (
  SELECT STRING_AGG("COUNTIF(Color = '" || Color || "') AS Color_" || Color ORDER BY Color)
  FROM UNNEST(Colors) AS Color
) || (
  SELECT ', ' || STRING_AGG("COUNTIF(Size = '" || Size || "') AS Size_" || Size ORDER BY Size)
  FROM UNNEST(Sizes) AS Size
) || '''
FROM `project.dataset.table`
GROUP BY ID
ORDER BY ID
''';  -- added `;`

SELECT * FROM result;  -- added line   

If to applied to sample data in your question - the output is as below

enter image description here

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • thanks. It doesn't return a table though, but a number of "jobs". how do I save the output as a table on bigquery? – Alex Jul 23 '20 at 14:04
  • 1
    sure, sorry - I thought this would be obvious to use in needed way. see update in my answer though. take a look to two added lines and one updated (they have respective comments so it is easy to identify them) :o) – Mikhail Berlyant Jul 23 '20 at 14:20
  • sorry, maybe it's obvious but I didn't use EXECUTE IMMEDIATE before. Even after adding the new lines you've included, I still need to click on "view results" to see the table output. I want to automatically save the one hot encoded dataset – Alex Jul 23 '20 at 14:50
  • To save the results automatically you can change the settings in the query tab to persist to a table, please refer to this [doc](https://cloud.google.com/bigquery/docs/writing-results#permanent-table) – Yasser Mar 07 '23 at 07:41