I have values like '<U+6B66>'... in my one column(company_name). Please suggest a robust method to remove or convert it to readable strings.
Asked
Active
Viewed 104 times
0
-
regex to remove `<.....>`?? `re.sub(r'\<[^)]*\>', '', st)` – Pygirl Sep 07 '21 at 17:23
-
If u are having pandas dataframe then you can use apply/map. Please add sample dataframe. – Pygirl Sep 08 '21 at 08:28
-
What's that U+6B66 represent ? Any emoji/symbol? If yes then we can map them. – Pygirl Sep 08 '21 at 08:28
-
these are company names in russian,chinese,japanese.. – Sep 08 '21 at 09:48
-
only below code works for me to remove such rows and the main issue here is data loss as most of the company_names are such ambiguous. df.drop(df[df.company_name.str.contains(r'[^0-9a-zA-Z]')].index, inplace=True) – Sep 08 '21 at 09:49
1 Answers
0
try using regex
and encode
:
string_unicode = " xyz <U+6B66> Æ for \u200c ab 23#. "
string_encode = re.sub(r'\<[^)]*\>', '', string_unicode)
string_encode = string_encode.encode("ascii", "ignore")
string_encode:
b' xyz for ab 23#. '

Pygirl
- 12,969
- 5
- 30
- 43