convert or least remove non-english/unwanted(non-ascii) values from pandas column or convert it to English characters

Question

I have values like '<U+6B66>'... in my one column(company_name). Please suggest a robust method to remove or convert it to readable strings.

If u are having pandas dataframe then you can use apply/map. Please add sample dataframe. — Pygirl, Sep 08 '21 at 08:28
What's that U+6B66 represent ? Any emoji/symbol? If yes then we can map them. — Pygirl, Sep 08 '21 at 08:28
only below code works for me to remove such rows and the main issue here is data loss as most of the company_names are such ambiguous. df.drop(df[df.company_name.str.contains(r'[^0-9a-zA-Z]')].index, inplace=True) — , Sep 08 '21 at 09:49

score 0 · Answer 1 · answered Sep 07 '21 at 17:26

0

try using regex and encode:

string_unicode = " xyz <U+6B66> Æ for \u200c ab 23#. "
string_encode = re.sub(r'\<[^)]*\>', '', string_unicode)
string_encode = string_encode.encode("ascii", "ignore")

string_encode:

b' xyz   for  ab 23#. '

answered Sep 07 '21 at 17:26

Pygirl

12,969
5
30
43

can you simulate the same for dataframe column?? – Sep 08 '21 at 05:44
above solution does not work for me.. – Sep 08 '21 at 05:47
Also is there a way I can convert above characters to Englisg(ascii)..???? – Sep 08 '21 at 07:22

convert or least remove non-english/unwanted(non-ascii) values from pandas column or convert it to English characters

1 Answers1