-2

I am working on a public dataset where I need to parse each row of the Table Posts. So, for example, three rows under the column tags have values:

1. <c#><.net>
2. <android> <python> <sockets> <android-studio><socket.io>
3. <android-studio><socket.io><.net>

I need to display the output as:

1. c#
2. .net
3. android
4. python
5. sockets
6. android-studio
7. socket.io

Is there a way I could do it in BigQuery?

madi
  • 19
  • 1
  • 4

2 Answers2

1

You can first use REGEXP_EXTRACT_ALL to match all the words inside < >'s, and then UNNEST the resulting arrays to flatted the output:

select e from (
select regexp_extract_all(x, r'<([^>]*)>') e
from unnest([
"<c#><.net>",
"<android> <python> <sockets> <android-studio><socket.io>",
"<android-studio><socket.io><.net>"
]) x), unnest(e) e

Produces

Row e   
1       c#
2       .net
3       android
4       python
5       sockets
6       android-studio
7       socket.io
8       android-studio
9       socket.io
10      .net

And to eliminate duplicates - run DISTINCT:

select distinct e from (
select regexp_extract_all(x, r'<([^>]*)>') e
from unnest([
"<c#><.net>",
"<android> <python> <sockets> <android-studio><socket.io>",
"<android-studio><socket.io><.net>"
]) x), unnest(e) e
Mosha Pasumansky
  • 13,206
  • 5
  • 32
  • 55
0

Below is for BigQuery Standard SQL

#standardSQL
SELECT DISTINCT tag
FROM `project.dataset.table`,
UNNEST(REGEXP_EXTRACT_ALL(tags, r'<([^>]*)>')) tag

You can test/play with above using sample data from your question as in below example

#standardSQL
WITH `project.dataset.table` AS (
  SELECT '<c#><.net>' tags UNION ALL
  SELECT '<android> <python> <sockets> <android-studio><socket.io>' UNION ALL
  SELECT '<android-studio><socket.io><.net>'
)
SELECT DISTINCT tag
FROM `project.dataset.table`,
UNNEST(REGEXP_EXTRACT_ALL(tags, r'<([^>]*)>')) tag  

with result

Row tag  
1   c#   
2   .net     
3   android  
4   python   
5   sockets  
6   android-studio   
7   socket.io    
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230