2

I am scraping a website and it has this JSON data in return.

https://pastebin.com/R50eTqrD this is output of print( repr( string ) ) https://pastebin.com/VH6JrDMG this is output of print( string )

I am doing

resp = json.loads(resp)

But its giving me this error

ValueError: Invalid \escape: line 1 column 170 (char 169)

I found a solution here and it suggested me to do

resp = json.loads(HTMLParser().unescape(resp.decode('unicode-escape')))

But it now throws this error

UnicodeEncodeError: 'ascii' codec can't encode characters in position 51-59: ordinal not in range(128)

I have tried several solutions like

json.loads(HTMLParser().unescape(resp.decode('unicode-escape')).encode("utf-8"))

and many more but none of it worked for me.

Umair Ayub
  • 19,358
  • 14
  • 72
  • 146

1 Answers1

2

There's a problem with those \x3E characters in the string. If s holds the string, try this:

json.loads(s.replace(r'\x3E', '\x3E'))
Tomáš Linhart
  • 9,832
  • 1
  • 27
  • 39
  • What's "wrong" with `\x3E`? - Please provide codepoint values/escapes if relevant for disambiguation. – user2864740 Jan 03 '18 at 07:39
  • Ohh raw string vs normal.. got it.. the JSON source is encoding as `\x3E` (invalid/broken) and not `\u003E` (valid JSON encoding).. I wonder if it might return `\xZZ` for other values of ZZ in the future. – user2864740 Jan 03 '18 at 07:41
  • Can you please add a little explanation in the answer? – Umair Ayub Jan 03 '18 at 07:44
  • @Umair JSON parser gives exact position where it encounters an error - *char 169*. I looked at that part of the string and saw those characters. JSON probably can't unescape it so I replaced the escape sequence with the actual character. – Tomáš Linhart Jan 03 '18 at 07:59