-1

I have the following dataframe:

corpus = pd.DataFrame({"tweet":["@blah Check tihs out @hay! This bear loves jumping on this plant!", 
          "I can't bear the noise from that power plant. It makes me jump."]})

...and I want to remove the user mentions i.e. "@blah" and "@hay"

I tried the following regex but this just removed the "@":

corpus["tweet"] = [re.sub(r'^@.*\s+$',' ', str(tweet)) for tweet in corpus["tweet"]]

What's the regex that I need to use to remove the whole username rather than just the @?

code_to_joy
  • 569
  • 1
  • 9
  • 27
  • 2
    try this `@\w+` – luigigi Jun 05 '20 at 06:13
  • @luigigi has a good solution for you, for more complex removals that is not encompassed with [`\w`](https://www.w3schools.com/jsref/jsref_regexp_wordchar.asp) try a [**lookbehind**](https://www.regular-expressions.info/lookaround.html). – leopardxpreload Jun 05 '20 at 06:16
  • Please be more precise. Do you wish to replace every substring that begins `'@'` and is followed by one or more lower case letters with an empty string? – Cary Swoveland Jun 05 '20 at 06:17

2 Answers2

0

This will remove @ followed by one or more non-whitespace characters.

With the \s*, it will also remove whitespace after that (not strictly in the question but likely to be intended), as otherwise space before and after the @mention will end up as double space in the output.

re.sub(r'@\S+\s*', '', str(tweet))
alani
  • 12,573
  • 2
  • 13
  • 23
0

Thanks to luigigi for this answer (it worked and its really simple):

@\w+

code_to_joy
  • 569
  • 1
  • 9
  • 27
  • You wish to match `"@_______"` and `"@00000000"`? If you like @luigigi's suggestion perhaps suggest that he post an answer. @alaniwi is about to make a good point. – Cary Swoveland Jun 05 '20 at 06:23
  • 1
    The question title says any characters except whitespace. The \w may be closer to what you really intended, but it is worth noting that it will not match punctuation for example. – alani Jun 05 '20 at 06:24