JSON Line issue when loading from import.io using Python

Question

I'm having a hard time trying to load an API response from import.io into a file or a list.

The enpoint I'm using is https://data.import.io/extractor/{0}/json/latest?_apikey={1}

Previously all my scripts were set to use normal JSON and all was working well, but now hey have decided to use json line, but somehow it seems malformed.

The way I tried to adapt my scripts is to read the API response in the following way:

url_call = 'https://data.import.io/extractor/{0}/json/latest?_apikey={1}'.format(extractors_row_dict['id'], auth_key)
r = requests.get(url_call)

with open(temporary_json_file_path, 'w') as outfile:
    json.dump(r.content, outfile)

data = []
with open(temporary_json_file_path) as f:
    for line in f:
        data.append(json.loads(line))

the problem doing this is that when I check data[0], all of the json file content was dumped in it...

data[1] = IndexError: list index out of range

Here is an example of data[0][:300]:

u'{"url":"https://www.example.com/de/shop?condition[0]=new&page=1&lc=DE&l=de","result":{"extractorData":{"url":"https://www.example.com/de/shop?condition[0]=new&page=1&lc=DE&l=de","resourceId":"23455234","data":[{"group":[{"Brand":[{"text":"Brand","href":"https://www.example.com'

Does anyone have experience with the response of this API? All other jsonline reads I do from other sources work fine except this one.

EDIT based on comment:

print repr(open(temporary_json_file_path).read(300))

gives this:

'"{\\"url\\":\\"https://www.example.com/de/shop?condition[0]=new&page=1&lc=DE&l=de\\",\\"result\\":{\\"extractorData\\":{\\"url\\":\\"https://www.example.com/de/shop?condition[0]=new&page=1&lc=DE&l=de\\",\\"resourceId\\":\\"df8de15cede2e96fce5fe7e77180e848\\",\\"data\\":[{\\"group\\":[{\\"Brand\\":[{\\"text\\":\\"Bra'

Wait, what? Your output looks like you (or they) added the `repr()` of the API content, so the JSON lines encoded as a Python literal. What does `print repr(open(temporary_json_file_path).read(300))` look like? — Martijn Pieters, Nov 29 '16 at 10:08
Yes, the data is *double-encoded*. This looks like a bug on the import.io side. How does their scraping work? Do you write some code? If so, don't encode to JSON on their side, because it looks like output is automatically JSON encoded. — Martijn Pieters, Nov 29 '16 at 10:14
You'd have to use `json.loads()` on the contents *again* to unwrap the double-encoding now, but not encoding twice would be preferable. — Martijn Pieters, Nov 29 '16 at 10:15
No, it's the raw output from the API, I'll try the double unwrap and check back if it works, otherwise Ill try and write a quick decoder myself. thanks! — johan855, Nov 29 '16 at 10:17
Yes, but looking at the [import.io tour](https://www.import.io/builder/tour/), the extractor is some kind of service that builds that data set. The most likely explanation for the double encoding is that something pushed single-encoded data into the data set, which then gets encoded again. — Martijn Pieters, Nov 29 '16 at 10:18

score 5 · Accepted Answer · answered Nov 29 '16 at 19:39

5

You've got a bug in your code where you are double encoding:

with open(temporary_json_file_path, 'w') as outfile:
    json.dump(r.content, outfile)

Try:

with open(temporary_json_file_path, 'w') as outfile:
    outfile.write(r.content)

answered Nov 29 '16 at 19:39

matt-import.io

66
2

2

I'm not sure how I missed that. I deleted my wrong answer. – Martijn Pieters Nov 29 '16 at 19:48
let me give it a try – johan855 Nov 30 '16 at 08:53

JSON Line issue when loading from import.io using Python

1 Answers1