4

I have a SQL table of the following format:

ID  Cat
1   A   
1   B
1   D
1   F
2   B
2   C
2   D
3   A
3   F

Now, I want to create a table with one ID per row, and multiple Cat's in a row. My desired output looks as follows:

ID  A  B  C  D  E  F
1   1  1  0  1  0  1
2   0  1  1  1  0  0
3   1  0  0  0  0  1

I have found:

Transform table to one-hot-encoding of single column value

However, I have more than 1000 Cat's, so I am looking for code to write this automatically, rather than manually. Who can help me with this?

Emil
  • 1,531
  • 3
  • 22
  • 47

2 Answers2

4

First let me transform the data you pasted into an actual table:

WITH data AS (
  SELECT REGEXP_EXTRACT(data2, '[0-9]') id, REGEXP_EXTRACT(data2, '[A-Z]') cat
  FROM (
    SELECT SPLIT("""1   A   
    1   B
    1   D
    1   F
    2   B
    2   C
    2   D
    3   A
    3   F""", '\n') AS data1
  ), UNNEST(data1) data2
)

SELECT * FROM data

enter image description here (try sharing a table next time)

Now we can do some manual 1-hot encoding:

SELECT id 
 , MAX(IF(cat='A',1,0)) cat_A
 , MAX(IF(cat='B',1,0)) cat_B
 , MAX(IF(cat='C',1,0)) cat_C
FROM data
GROUP BY id

enter image description here

Now we want to write a script that will automatically create the columns we want:

SELECT STRING_AGG(FORMAT("MAX(IF(cat='%s',1,0))cat_%s", cat, cat), ', ') 
FROM (
  SELECT DISTINCT cat
  FROM data
  ORDER BY 1
)

enter image description here

That generates a string that you can copy paste into a query, that 1-hot encodes your arrays/rows:

SELECT id
,
MAX(IF(cat='A',1,0))cat_A, MAX(IF(cat='B',1,0))cat_B, MAX(IF(cat='C',1,0))cat_C, MAX(IF(cat='D',1,0))cat_D, MAX(IF(cat='F',1,0))cat_F
FROM data
GROUP BY id

enter image description here

And that's exactly what the question was asking for. You can generate SQL with SQL, but you'll need to write a new query using that result.

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
  • 1
    Hmm not really working for me... I get "The query is too large. The maximum standard SQL query length is 1024.00K characters, including comments and white space characters."... I guess I would have to combine queries or something. This is getting silly. Python it is :P – Charles Dec 10 '20 at 13:49
0

BigQuery has no dynamic column with standardSQL, but depending on what you want to do on the next step, there might be a way to make it easier.

Following code sample groups Cat by ID and uses a JavaScript function to do one-hot encoding and return JSON string.

CREATE TEMP FUNCTION trans(cats ARRAY<STRING>)
RETURNS STRING
LANGUAGE js
AS
"""
// TODO: Doing one hot encoding for one cat and return as JSON string
return "{a:1}";
"""
;
WITH id_cat AS (
SELECT 1 as ID, 'A' As Cat UNION ALL
SELECT 1 as ID, 'B' As Cat UNION ALL
SELECT 1 as ID, 'C' As Cat UNION ALL
SELECT 2 as ID, 'A' As Cat UNION ALL
SELECT 3 as ID, 'C' As Cat)
SELECT ID, trans(ARRAY_AGG(Cat))
FROM id_cat
GROUP BY ID;
Yun Zhang
  • 5,185
  • 2
  • 10
  • 29