0

I have a large table with tweets containing a username and tweet text per row. The tweet text has often mentionings (@username). I want to extract all usernames and build a new table for a social network analysis, where each row has one mentioning.

Example rows:

|-------------------|--------------------------------------|
|      username     |     tweet                            |
|-------------------|--------------------------------------|
|      userA        |     great stuff @userC and @userB    |
|-------------------|--------------------------------------|
|      userB        |     thanks for mentioning @userE     |
|-------------------|--------------------------------------|

Shall result in:

 |-------------------|--------------------------------------|
 |      tweet_by     |     repied_to                        |
 |-------------------|--------------------------------------|
 |      userA        |     userC                            |
 |-------------------|--------------------------------------|
 |      userA        |     userB                            |
 |-------------------|--------------------------------------|
 |      userB        |     userE                            |
 |-------------------|--------------------------------------|

I found this question, but I could not figure out a solution with split() and regexp_extract: Transform data in Google bigquery - extract text, split it into multiple columns and pivoting the data

Community
  • 1
  • 1
crisscross
  • 1,675
  • 2
  • 18
  • 28

1 Answers1

1

try below simple option. it should work as your extraction criteria is quite simple I think. Unless you want to handle some edge cases

SELECT 
  username AS tweet_by, 
  SPLIT(tweet, ' ') AS repied_to 
FROM YourTable
HAVING LEFT(repied_to, 1) = '@'

Added - to address potential use cases like

userA     great stuff @userC&@userB  
userB     thanks for mentioning  @userE!  
userC     great stuff  @userC,@userB  

Query

SELECT
  tweet_by,
  REPLACE(word, '@', '') AS repied_to
FROM (  
  SELECT 
    username AS tweet_by,
    SPLIT(REGEXP_REPLACE(tweet, r'([^\w@])', ' '), ' ') AS word 
  FROM YourTable
  HAVING LEFT(word, 1) = '@'
)
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • Thanks a lot that works great. Can you maybe give an example of such an edge case? – crisscross Apr 01 '16 at 18:22
  • 1
    didn't mean anything in particular, but let's say if it is just arbitrary text as 'great stuff \@userC,\@userB ' - query in answer will just extract '\@userC,\@userB' - so in this/such cases you would need to put more logic into query in order to extract \@userC and \@userB separately - (btw- ignore \ in \@) – Mikhail Berlyant Apr 01 '16 at 20:13