pandas str.extractall on complete words

Question

I have a column of tweets. I want to get a list of all mentions inside the tweet using the regex:

\@(\w+)

I tried using df.Tweets.str.extractall('\@(\w+)') but it doesn't succeed with matching the entire word as it wants (my guess) to separate each word to many columns. I get the following error

AssertionError: 1 columns passed, passed data had 15 columns.

I'll say that '\@(\w)' works as expected and returns a result but only the first letter. the + for the entire word is probably the root.

This is the ISIS dataset from Kaggle. For example, the first match is on
'Aslm Please share our new account after the previous one was suspended.@KhalidMaghrebi @seifulmaslul123 @CheerLeadUnited'
using .extract() works fine but only finds the first one. using .extractall('\@(\w)') I get:

             0
  match   
8     0      K
      1      s
      2      C

which makes sense. But extracting all the complete words gives an error.

the regex you're using and `extractall` are the way to do. I'm guessing it has something to do with the dataframe. We can not tell unless you share it. — piRSquared, Jul 12 '16 at 15:00
the df in question is the ISIS kaggle dataset. First cell to match is `'@AbdirahmanBash2 @KhalidMaghrebi_ @IbnNabih1 @Polder_Mujahid Aslm, we have completed the translation with the exception of a few news'` — DeanLa, Jul 12 '16 at 15:01

score 2 · Answer 1 · answered Jul 12 '16 at 15:21

2

Apparently pandas looks to separate groups to columns so the solution is to wrap all the regex also as a group.
df.Tweets.str.extractall('(\@(\w+))')

difference being a wrapping parenthesis inside the string.

answered Jul 12 '16 at 15:21

DeanLa

1,871
3
21
37

`ValueError: pattern contains no capture groups` – DeanLa Jul 12 '16 at 15:31

pandas str.extractall on complete words

1 Answers1