1

I have a dataframe which contains a lot of different emojis and I want to remove them. I looked at answers to similar questions but they didn't work for me.

index| messages
----------------
1    |Hello!  
2    |Good Morning   
3    |How are you ?
4    | Good 
5    | Ländern

Now I want to remove all these emojis from the DataFrame so it looks like this

    index| messages
    ----------------
    1    |Hello!
    2    |Good Morning   
    3    |How are you ?
    4    | Good 
    5    |Ländern

I tried the solution here but unfortunately it also removes all non-English letters like "ä" How can I remove emojis from a dataframe?

xjcl
  • 12,848
  • 6
  • 67
  • 89
Sam
  • 33
  • 8

3 Answers3

3

This solution that will keep all ASCII and latin-1 characters, i.e. characters between U+0000 and U+00FF in this list. For extended Latin plus Greek, use < 1024:

df = pd.DataFrame({'messages': ['Länder ❤️', 'Hello! ']})

filter_char = lambda c: ord(c) < 256
df['messages'] = df['messages'].apply(lambda s: ''.join(filter(filter_char, s)))

Result:

  messages
0  Länder 
1  Hello!

Note this does not work for Japanese text for example. Another problem is that the heart "emoji" is actually a Dingbat so I can't simply filter for the Basic Multilingual Plane of Unicode, oh well.

xjcl
  • 12,848
  • 6
  • 67
  • 89
  • 1
    Works well, extra credits for the flag ;-) – Ruthger Righart Dec 02 '20 at 14:35
  • Thank you a lot this worked for me (Vielen Dank) – Sam Dec 02 '20 at 14:52
  • If this doesn't work for some cases, you can also try `filter(lambda c: c.isalpha(), s)` -- that should handle Japanese for example. But it does filter `!` -- oh well. – xjcl Dec 02 '20 at 15:00
  • We are not suppose to assign lambda expression to variable. `df['messages'] = df['messages'].apply(lambda s: ''.join(filter(lambda c: ord(c) < 256, s)))` will be correct one. – Rahul Kumeriya Sep 08 '21 at 04:06
1

I think the following is answering your question. I added some other characters for verification.

import pandas as pd
df = pd.DataFrame({'messages':['Hello! ', 'Good-Morning ', 'How are you ?', ' Goodé ', 'Ländern' ]})

df['messages'].astype(str).apply(lambda x: x.encode('latin-1', 'ignore').decode('latin-1'))
Ruthger Righart
  • 4,799
  • 2
  • 28
  • 33
1

You can use emoji package:

import emoji

df = ...
df['messages'] = df['messages'].apply(lambda s: emoji.replace_emoji(s, ''))
Guru Stron
  • 102,774
  • 10
  • 95
  • 132