Pyspark dataframe corrupted record when reading from python dictionary(json) got from requests, encoding problem

Question

I am making a REST api call with Requests library.

response = requests.get("https://urltomaketheapicall", headers={'authorization': 'bearer {0}'.format("7777777777777777777777777777")}, timeout=5)

When I do response.json()

I get a key with these values

{'devices': '....iPhone\xa05S, iPhone\xa06, iPhone\xa06\xa0Plus, iPhone\xa06S'}

When I do print(response.encoding) I get None

When I do print(type(data[devices])) I get <class 'str'>

If i do print(data[devices]) I get '....iPhone 5S, iPhone 6, iPhone 6 Plus, iPhone 6S' without the special characters.

Now if do

new_dict={}
new_val = data[devices]
new_dict["devices"] = new_val
print(new_dict["devices"])

I will get the special characters in the new dictionary as well.

Any ideas?

I want to get rid of the special characters since I need to read these json and put it in a pyspark dataframe and with those characters i get a _corrupted_record

rd= spark.sparkContext.parallelize([data])
df = spark.read.json(rd)

I want to avoid solutions like .replace("\\xa0"," ")

score 0 · Answer 1 · answered Jun 11 '20 at 08:09

0

A0 is a no-break space. It's simply part of the string. It simply prints like that because you're dumping the repr of an entire dict. It'll simply print as proper no-break space if you print the individual string:

>>> print({'a': '\xa0'})
{'a': '\xa0'}
>>> print('\xa0')
 
>>>

answered Jun 11 '20 at 08:09

deceze

510,633
85
743
889

check my edit, with the special characters i can not put it in a pyspark dataframe – elvainch Jun 11 '20 at 08:10
Concentrate another question on that specifically then. I don't know pyspark and whether you're simply doing it wrong, or whether it simply can't handle no-break spaces. – deceze Jun 11 '20 at 08:12

Pyspark dataframe corrupted record when reading from python dictionary(json) got from requests, encoding problem

1 Answers1