0

I have a large data set in Google BigQuery with millions of rows of dirty data (App tracking) that I am trying to clean up. One of my problems is that the same data got sent to different columns for different events triggered in the App. By this I mean that maybe the country was sent to custom dimension 1 for some events, but to custom dimension 147 for other events. I can't post actual data, but a SELECT * FROM table_with_dirty_data would produce something like this:

date | session | eventAction | cd001 | cd002    | cd004    | cd005 
-----|---------|-------------|-------|----------|----------|-------
1    | 1       | 'event_1'   | '1'   | 'Pizza'  | null     | '21'
1    | 1       | 'event_2'   | '10'  | '25'     | 'Pizza'  | '14.56'
1    | 1       | 'event_3'   | '3.1' | null     | '15'     | 'France'
1    | 2       | 'event_1'   | '6'   | 'Burger' | null     | '21'
1    | 2       | 'event_2'   | '21'  | '25'     | 'Burger' | '12.6'

The ultimate goal here is to end up with a clean table that can be analyzed. Something like this:

date | session | eventAction | country  | vendor   | product  | price
-----|---------|-------------|----------|----------|----------|-------
1    | 1       | 'event_1'   | 'France' | '1'      | 'Pizza'  | '14.56'
1    | 1       | 'event_2'   | 'France' | '1'      | 'Pizza'  | '14.56'
1    | 1       | 'event_3'   | 'France' | '1'      | 'Pizza'  | '14.56'
1    | 2       | 'event_1'   | 'Spain'  | '25'     | 'Burger' | '12.6'
1    | 2       | 'event_2'   | 'Spain'  | '25'     | 'Burger' | '12.6'

I am aware that some events will need some degree of statistical imputation and data type casting, but for now I am only concerned with getting every variable into its own column. So I've created an auxiliary table (let's call it matrix) that looks like this:

event_name | variable_1 | variable_2 | variable_3
-----------|------------|------------|----------
'event_1'  | 'cd020'    | 'cd035'    | 'cd120'
'event_2'  | 'cd020'    | 'cd146'    | 'cd056'
'event_3'  | 'cd001'    | 'cd020'    | 'cd035'

and so on, where the value in each cell of column variable_# is the column name in the table_with_dirty_data where the information can be found. That is, variable_1 can be found in a column called cd020 for events with event_name 'event_1' and 'even_2', but in a column called cd001 for events with event_name 'event_3'. So basically what matrix does is to map to which custom dimension each variable was sent for each event.

I have hundreds of different events in table_with_dirty_data, and matrix holds all 200 GA custom dimensions, so doing something like

SELECT 
  CASE 
    WHEN event_name = 'event_1' THEN cd020
    WHEN event_name = 'event_2' THEN cd020
    WHEN event_name = 'event_3' THEN cd001
  END AS variable_1
  , CASE
      WHEN event_name = 'event_1' THEN cd035
      WHEN event_name = 'event_2' THEN cd146
      WHEN event_name = 'event_3' THEN cd020
    END AS variable_2
  , CASE
      WHEN event_name = 'event_1' THEN cd120
      WHEN event_name = 'event_2' THEN cd056
      WHEN event_name = 'event_3' THEN cd035
  END AS variable_3
FROM table_with_dirty_data

would take me a looooong time and is very error-prone. What I am trying to do is to use SELECT statements that return the column name (cd###) from the table_with_dirty_data where the information is stored and the use a WHILE to loop over all different events. So, for example, with event_name = 'event_1' it would be something like this:

SELECT 
  CASE
    WHEN event_name = 'event_1'
    THEN (SELECT variable_1 FROM matrix WHERE event_name = 'event_1')
  END AS variable_1
  , CASE
      WHEN event_name = 'event_1'
      THEN (SELECT variable_2 FROM matrix WHERE event_name = 'event_1')
    END AS variable_2
  , CASE
      WHEN event_name = 'event_1'
      THEN (SELECT variable_3 FROM matrix WHERE event_name = 'event_1')
    END AS variable_3
FROM table_with_dirty_data
WHERE event_name = 'event_1'

The goal here is that I can then loop over an array with all the event_names (something that can be easily achieved). Ultimately I need all events in one table, but I'm ok with having a table for curating each event, as long as it can be done programmatically (something I personally don't know if it's even possible in GBQ...would have to check on that).

The issue is that the SELECT statements I am using get evaluated to a string, so the result of the queries inside the CASE clause are string literals. For example, if I were to run it for event = 'event_1',

SELECT variable_1 FROM matrix WHERE event_name = 'event_1'

evaluates to

'cd020'

which then cause the outer query to become

SELECT 
  CASE
    WHEN event_name = 'event_1' THEN 'cd020'
  END AS variable_1
  , CASE
      WHEN event_name = 'event_1' THEN 'cd035'
    END AS variable_2
  , CASE
      WHEN event_name = 'event_1' THEN 'cd120'
    END AS variable_3
FROM table_with_dirty_data
WHERE event_name = 'event_1'

which produces a table like this

event_name | variable_1 | variable_2 | variable_3
-----------|------------|------------|----------
'event_1'  | 'cd020'    | 'cd035'    | 'cd120'
'event_1'  | 'cd020'    | 'cd035'    | 'cd120'
'event_1'  | 'cd020'    | 'cd035'    | 'cd120'

instead of returning the value stored in columns cd020, cd035 or cd120 for each time the event 'event_1' is found.

Does anyone know a way to unquote the result of those inner queries so that when the outer query executes, they are transformed into column names (so 'cd020' becomes cd020)??

PS: I am also open to a totally different strategy if anyone knows one.

Daniel
  • 1,005
  • 1
  • 16
  • 22
  • I can recommend you better ways, but can you please provide a few lines of sample data and expected output? Do you want to have a separate table for each event? – Sabri Karagönen Jun 02 '20 at 08:58
  • @SabriKaragönen I've edited the question. Hope it's clearer now. Really looking forward to hearing different ways. Thanks – Daniel Jun 02 '20 at 15:40

2 Answers2

1

Sounds like a case where EXECUTE IMMEDIATE might be helpful. I'm not sure I fully understand the mapping, but it should allow you to compose a dynamic statement based off your matrix table.

However, it is unclear based on the description how unwieldy the generated query might be, as you mention having hundreds of these kinds of events and you may start running into other problems such as query complexity or length limits.

shollyman
  • 4,216
  • 19
  • 17
0

I solved the problem using the EXECUTE IMMEDIATE clause as recommended by @shollyman. I did it for just one of the events but I believe this answers the original question (scaling it to other events is just a matter of writing a WHILE loop). I'll go step by step since I'm not using an actual repex.

Step one I declared a variable called event with the name of the event that the query needs to look for.

DECLARE event STRING DEFAULT 'event_1';

Then I declared a variable that holds the column name where the query needs to look for the variable.

DECLARE variable_name STRING DEFAULT (SELECT variable_1 FROM matrix WHERE event_name = event);

Then I wrote the query as I would normally do, but using the EXECUTE IMMEDIATE clause. I used triple double-quotes so that I could break it into several lines for readability).

EXECUTE IMMEDIATE CONCAT("""
  SELECT
    CASE WHEN event_name = '""", event, "' THEN ", variable_name, """ END AS variable_1 
  FROM table_with_dirty_data
  WHERE event_name = '""", event, """'
""");

If others are going to use this, do take note of the lonely single quotes that I used before or after some triple double quotes. I did that because, for example, the declared variable event, even though it's a string, seems to get concatenated as event (no single quotes around it) and that breaks the query execution.

Daniel
  • 1,005
  • 1
  • 16
  • 22