2

I have some weblog data in big query which I need to transform to make it easier to use and query. The data looks like:

Big_Query_Table

I want to extract and transform the data within the curled brackets after Results{…..} (colored blue). The data is of the form ‘(\d+((PQ)|(KL))+\d+)’ and there can be 1-20+ entries in the result array. I am only interested in the first 16 entries.

I have been able to extract the data within curled brackets into a new column, using Substr and regext_extract. But I'm unable to SPLIT it into columns (sometimes there is only 1 result and so the delimiter "," is missing. I'm new with regex, may be I can use something like ‘(\d+((PQ)|(KL))+\d+){1}’ etc. to split the data into multiple columns and then pivot it.

Ideal output in my case would be to transform it into something like:

Ouput Table

In the above solution, each row in original table is repeated from 1-16 times depending on the number of items in the Results array.

I’m not completely sure if it’s possible to do this in big query. I’ll be grateful if anyone can help me out a little here.

If this is not possible, then I can have 16 rows for every event with NULL values in Event_details for cases where there are less than 16 entries in result array.

In case both of these are not possible, the last solution would be to have it transformed into something like: back_up_output_table

The reason I want to transform the data is that in most of the cases I would need to find which result array items are appearing and in what order.

ravishchhabra
  • 81
  • 1
  • 6

2 Answers2

2

Check this out: Split string into multiple columns with bigquery. In their case its delimited by spaces. replace the \s with ','

something like:

SELECT  
Regexp_extract(StringToParse,r'^*{(?:[^,]*,){0}(\d+(?:(?:PQ)|(?:KL))+\d+)\s?') as Word0,
Regexp_extract(StringToParse,r'^*{(?:[^,]*,){1}(\d+(?:(?:PQ)|(?:KL))+\d+)\s?') as Word1,
Regexp_extract(StringToParse,r'^*{(?:[^,]*,){2}(\d+(?:(?:PQ)|(?:KL))+\d+)\s?') as Word2,
Regexp_extract(StringToParse,r'^*{(?:[^,]*,){3}(\d+(?:(?:PQ)|(?:KL))+\d+)\s?') as Word3,
FROM
(SELECT 'bla{1234PQ5,6789KL0,1234PQ5,6789KL0,123' as StringToParse)
Community
  • 1
  • 1
N.N.
  • 3,094
  • 21
  • 41
  • Thanks for the answer N.N. This helps me break the words into multiple columns. The only issue is with Word0 - its not being captured in this solution and is always coming as Null. Word0 is between "{" and a comma or }. I tried changing the code to: Regexp_extract(StringToParse,r'^(?:[^{]*,){0}(\d+(?:(?:AUCN)|(?:TRNL))+\d+),\s?') as Word0 but it didn't work – ravishchhabra Oct 10 '14 at 02:26
  • Also, can you suggest a way to transform this into the first solution that I had suggested. I was keeping this solution as my last option, since my first preference would be pivoting the data...so that the different items in results come into different rows – ravishchhabra Oct 10 '14 at 02:37
1

Use SPLIT()

BQ split documentation.

SELECT Event_ID, Event_UserID, Event_SessionID, Keyword,
SPLIT(REGEXP_EXTRACT(Event_details,"Results\{(.*)\}"),",") as Event_details_item
FROM mydata.mytable
David M Smith
  • 2,212
  • 4
  • 21
  • 27
  • Thanks for the answer David...I tried using SPLIT as well...but I'm somehow getting only the first item in the result array. There can be 1 or more items in the result array and I want to capture upto 16 items. – ravishchhabra Oct 10 '14 at 04:42
  • It seems to work: SELECT EventID,UserID ,SessionID,KeyWord,Row_Number() Over(Partition by EventID,UserID ,SessionID,KeyWord) as Rank, SPLIT(Regexp_extract(EventDetails,r'^*\{(.*)\}'),",") FROM (SELECT 123 as EventID, 456 as UserID, 789 as SessionID,'ABS' as KeyWord, 'bla{1234PQ5,6789KL0,1234PQ5,6789KL0,123}' as EventDetails). can you share your query so we can explore it? – N.N. Oct 10 '14 at 05:22