29

I have a page URL column components of which are delimited by /. I tried to run the SPLIT() function in BigQuery but it only gives the first value. I want all values in specific columns.

I don't understand how to use the Regexp_extract() example mentioned in Split string into multiple columns with bigquery.

I need something similar to REGEX_SPLIT_TO_TABLE(<String>, <DELIMITER>) which converts a single string into multiple columns.

Query:

SELECT PK, 
DATE(TIMESTAMP(CONCAT(SUBSTR(date,1,4),'-',SUBSTR(date,5,2),'-',SUBSTR(date,7,2),' 00:00:00'))) as visit_date,
hits_page_pagePath,
split(hits_page_pagePath,'/')
FROM [Intent.All2mon] limit 100
Community
  • 1
  • 1
Saumil Agrawal
  • 311
  • 1
  • 3
  • 6
  • what does the data look like? – Felipe Hoffa Nov 21 '14 at 12:20
  • Hi Felipe, Can you please help me understand how to prepare regular expression to extract information from each entity within '/' symbols? – Saumil Agrawal Nov 21 '14 at 14:10
  • You can't generate a variable number of columns in BigQuery, only a constant number of columns, so you can't do exactly what you want. If you can set an upper bound on the number of URL components, then you can generate that many columns and make them NULL when a particular URL has fewer components. – sprocket Nov 22 '14 at 19:19

5 Answers5

77

2018 standardSQL update:

#standardSQL
SELECT SPLIT(path, '/')[OFFSET(0)] part1,
       SPLIT(path, '/')[OFFSET(1)] part2,
       SPLIT(path, '/')[OFFSET(2)] part3
FROM (SELECT "/a/b/aaaa?c" path)

Now I understand you want them in different columns.

An alternative to the query you provided:

SELECT FIRST(SPLIT(path, '/')) part1,
       NTH(2, SPLIT(path, '/')) part2,
       NTH(3, SPLIT(path, '/')) part3
FROM (SELECT "/a/b/aaaa?c" path)

NTH(X, SPLIT(s)) will provide the Xth value from the SPLIT. FIRST(s) is the same as NTH(1, s)

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
  • 1
    can you provide a check if the [OFFSET(2)] doesn't exist in some cases so it won't overflow? Something like: `IF(SPLIT(path, '/')[OFFSET(2)].doesnt_exist), Null, SPLIT(path, '/')[OFFSET(2)]` ? – Lev May 24 '19 at 14:32
  • 3
    Prefix with SAFE. - or post a new question ;) – Felipe Hoffa May 24 '19 at 14:34
  • 3
    @FelipeHoffa, the second query doesn't seem to work in BigQuery. But the first one works cool! – Haipeng Su Jan 22 '20 at 18:09
6

You can also try the following with SPLIT function, however you need to know how many '/' your url would have or make enough entries so that if your url contains more '/' then you can still get those values in the seperate columns

  SPLIT(`url`, '/')[safe_ordinal(1)] AS `Col1`, 
  SPLIT(`url`, '/')[safe_ordinal(2)] AS `Col2`,
  SPLIT(`url`, '/')[safe_ordinal(3)] AS `Col3`, 
  SPLIT(`url`, '/')[safe_ordinal(4)] AS `Col4`,
  .
  .
  SPLIT(`url`, '/')[safe_ordinal(N)] AS `ColN`
LondonUK
  • 437
  • 1
  • 8
  • 20
2

in standard sql, you can use the following functions:

array[OFFSET(zero_based_offset)]
array[ORDINAL(one_based_ordinal)]

so

SELECT SPLIT(path, '/')[OFFSET(1)] part2,
       SPLIT(path, '/')[ORDINAL(2)] part2_again,
       SPLIT(path, '/')[ORDINAL(3)] part3
FROM (SELECT "/a/b/aaaa?c" path)

part2   part2_again part3    
a       a           b

part1 in this case, is empty string (before the first slash)

marengaz
  • 1,639
  • 18
  • 28
0

This works for me:

SELECT SPLIT(path, '/') part
FROM (SELECT "/a/b/aaaa?c" path)

Row part     
1   a    
2   b    
3   aaaa?c

Not sure why it wouldn't work for you. What does your data look like?

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
  • 1
    SPLIT() splits the column value into rows without adding a row_number against it. – Saumil Agrawal Nov 21 '14 at 13:49
  • What I need, is a way to segregate every instance of string between '/' into separate columns...without running a separate split() query for each possible instance... – Saumil Agrawal Nov 21 '14 at 13:50
  • I have built this query: SELECT date, hits_time, fullVisitorId, visitNumber, hits_hitNumber, hits_page_pagePath, REGEXP_EXTRACT(hits_page_pagePath,r'/(\w*)\/') as one, REGEXP_EXTRACT(hits_page_pagePath,r'/\w*\/(\w*)') as two, REGEXP_EXTRACT(hits_page_pagePath,r'/.*\/.*\/(.*)\/') as three from [Intent.All2mon] limit 1000 but it still doesnt separate out the strings... – Saumil Agrawal Nov 21 '14 at 15:09
  • that should have worked - can you share a sample dataset so I can run queries over your data? I don't have access to [Intent.All2mon] – Felipe Hoffa Nov 21 '14 at 19:21
  • 3
    NOTE: If you select a destination table and uncheck "flatten results", you will get a single row with a repeated parts column. – Michael Sheldon Nov 24 '14 at 17:34
0

Solved it in a way.

   SELECT
   date, 
   hits_time, 
   fullVisitorId, 
   visitNumber, 
   hits_hitNumber,
   X.page_path,
   REGEXP_EXTRACT(X.page_path,r'/(\w*)\/') as one,
   REGEXP_EXTRACT(X.page_path,r'/\w*\/(\w*)') as two,
   REGEXP_EXTRACT(X.page_path,r'/\w*\/\w*\/(\w*)') as three,
   REGEXP_EXTRACT(X.page_path,r'/\w*/\w*/\w*\/(\w*)\/.*') as four
   from
   (
   select 
   date, hits_time, fullVisitorId, visitNumber, hits_hitNumber,
   REGEXP_REPLACE (hits_page_pagePath, '-', '') as page_path
   from
   [Intent.All2mon]
   ) X 
   limit 1000
Saumil Agrawal
  • 311
  • 1
  • 3
  • 6