0

enter image description hereI have values like '<U+6B66>'... in my one column(company_name). Please suggest a robust method to remove or convert it to readable strings.

  • regex to remove `<.....>`?? `re.sub(r'\<[^)]*\>', '', st)` – Pygirl Sep 07 '21 at 17:23
  • If u are having pandas dataframe then you can use apply/map. Please add sample dataframe. – Pygirl Sep 08 '21 at 08:28
  • What's that U+6B66 represent ? Any emoji/symbol? If yes then we can map them. – Pygirl Sep 08 '21 at 08:28
  • these are company names in russian,chinese,japanese.. –  Sep 08 '21 at 09:48
  • only below code works for me to remove such rows and the main issue here is data loss as most of the company_names are such ambiguous. df.drop(df[df.company_name.str.contains(r'[^0-9a-zA-Z]')].index, inplace=True) –  Sep 08 '21 at 09:49

1 Answers1

0

try using regex and encode:

string_unicode = " xyz <U+6B66> Æ for \u200c ab 23#. "
string_encode = re.sub(r'\<[^)]*\>', '', string_unicode)
string_encode = string_encode.encode("ascii", "ignore")

string_encode:

b' xyz   for  ab 23#. '
Pygirl
  • 12,969
  • 5
  • 30
  • 43