1

I want to replace html character to string in dataframe.

I tried below code but can't change to stirng.

import html
html.unescape(data)

Here is my dataframe and How can I this?

For your reference, This result from Translation API by Google Cloud.

ID A1 A2 A3 1 I don't know if it doesn't meet Actually it was hard for me to understand that... I don't know if it doesn't meet my exp... 2 NaN NaN NaN 3 I think it's a correct web design, at leas... NaN This item costs ¥400 or £4.

enter image description here

purplecollar
  • 176
  • 1
  • 14
  • What are you trying to convert to String? Where is the code where you add the data to the dataframe? What is the datatype of the values in your dataframe right now? – Vishakha Lall Feb 06 '20 at 05:45

1 Answers1

6

If you didn't have any NaN's, then you could simply use applymap() to have all cells processed by html.escape.

So if you find acceptable to convert NaN's to empty strings, you can use:

df.fillna("").applymap(html.unescape)

If you want to preserve NaN's, then a good solution is to use stack() to turn columns into another level of the index, which will suppress NaN entries. Then you can use apply() (since it's a Series now, not a DataFrame) and later unstack() to get it back to its original format:

df.stack().apply(html.unescape).unstack()

But note that this last method will get rid of rows or columns entirely made of NaN's, not sure if that's acceptable to you.

One more alternative is to use applymap() but use a lambda and only apply html.unescape to the terms that are not NaN:

df.applymap(lambda x: html.unescape(x) if pd.notnull(x) else x)
filbranden
  • 8,522
  • 2
  • 16
  • 32
  • If there are NaN, why can't applymap? – purplecollar Feb 06 '20 at 06:38
  • 1
    If you applymap directly, `html.unescape` complains that it can't handle floats. NaN is technically a float number (or more exactly "Not-a-Number".) In any case, that function doesn't know how to handle it – filbranden Feb 06 '20 at 06:40
  • 2
    You can use something like `df.applymap(lambda x: html.unescape(x) if pd.notnull(x) else x)`, that will only call `html.unescape` on the terms that are not NaN... – filbranden Feb 06 '20 at 06:42
  • 1
    Note that `int` is also not iterable. You need to check if x is a str. – phen0menon Oct 18 '22 at 15:00