How to read lines from a json file in scrapy

Question

I have a json file storing some user information including id, name and url. The json file looks like this:

{"link": "https://www.example.com/user1", "id": 1, "name": "user1"}
{"link": "https://www.example.com/user1", "id": 2, "name": "user2"}

This file was written by a scrapy spider. Now I want to read the urls from the json file and scrape each user's webpage. But I cannot load the data from the json file.

At this time, I have no idea how to get these urls. I think I should read the lines from the json file first. I tried the following code in Python shell:

import json    
f = open('links.jl')    
line = json.load(f)

I got the following error message:

Raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 2 column 1- line 138 column 497(char498-67908)

I did some searches online. The search suggested that the json file may have some formatting issues. But the json file was created and populated with items using scrapy pipeline. Does anybody have a clue what caused the error? And how to solve it? Any suggestions on reading the urls?

Thanks a lot.

are you sure there is a blank line between each json object? can you confirm, along with the scrapy version. — Shane Evans, Dec 24 '12 at 10:51

score 7 · Answer 1 · edited Nov 07 '18 at 18:18

7

Those are json lines as the exporter name implies.

Take a look in scrapy.contrib.exporter and see the difference between JsonItemExporter and JsonLinesItemExporter

This should do the trick:

import json

lines = []

with open('links.jl', 'r') as f:
    for line in f:
        lines.append(json.loads(line))

edited Nov 07 '18 at 18:18

petezurich

9,280
9
43
57

answered Apr 16 '13 at 19:30

marius_5

501
5
12

score 1 · Answer 2 · answered Dec 24 '12 at 06:25

Hmm... that exception is interesting... I'll just... leave this here (without warranty or good conscience).

import json
import re

parse_err = re.compile(
    r'Extra data: line \d+ column \d+'
    r' - line \d+ column \d+'
    r' \(char (\d*).*')

def recover_bad_json(data):
    while data:
        try:
            yield json.loads(data)
            return
        except ValueError, e:
            char = parse_err.match(e.args[0]).group(1)
            maybe_data, data = data[:int(char)], data[int(char):]
            yield json.loads(maybe_data)

CORPUS = r'''{"link": "https://www.domain.com/user1", "id": 1, "name": "user1"}

{"link": "https://www.domain.com/user1", "id": 2, "name": "user2"}
'''

gen_recovered = recover_bad_json(CORPUS)

print gen_recovered.next()
print gen_recovered.next()
print gen_recovered.next()

score 0 · Answer 3 · answered Dec 24 '12 at 05:23

0

If you suspect that a JSON document may be malformed, I recommend submitting the document to JSONLint. The tool will prettify the document formatting, and highlight any structural or style issues encountered during parsing. I've used this tool in the past to find extra commas and broken quotation marks in JSON document generators.

answered Dec 24 '12 at 05:23

Willi Ballenthin

6,444
6
38
52

erm... the provided example is obviously malformed: `json.loads('{}{}')` gives a similar error. – SingleNegationElimination Dec 24 '12 at 06:26

score 0 · Answer 4 · answered Dec 24 '12 at 05:27

I've found this kind of formatting with poorly designed JSON APIs before. This may not be the best solution, but this is a small function I use to convert this kind of an output into a dict that contains all of the result objects in a list.

def json_parse(data):
    d = data.strip().replace("\n\n", ",")
    d = '{"result":[' + d + ']}'
    return json.loads(d)

You may need to tinker with it a bit depending on the number of newlines separating them and such. Read the file with .read() and call json_parse on the data, and you should be able to iterate over everything by accessing data["results"].

It would be better if you could have your scraping results provide valid JSON, but in the mean time something like this can work.

score 0 · Answer 5 · answered Dec 24 '12 at 08:12

AFAIK, a JSON file should contain a single object. In your case you have several:

{"link": "https://www.domain.com/user1", "id": 1, "name": "user1"}

{"link": "https://www.domain.com/user1", "id": 2, "name": "user2"}

I would do something like:

Python 2.7.3 (default, Sep 26 2012, 21:51:14) 
>>> import json
>>> inpt_json = """{"link": "https://www.domain.com/user1", "id": 1, "name": "user1"}
...     
...     {"link": "https://www.domain.com/user1", "id": 2, "name": "user2"}"""

>>> for line in inpt_json.splitlines():
...     line = line.strip()
...     if line:
...             print json.loads(line)
... 
{u'link': u'https://www.domain.com/user1', u'id': 1, u'name': u'user1'}
{u'link': u'https://www.domain.com/user1', u'id': 2, u'name': u'user2'}
>>>

So, saying "I have a json file storing some user information..." is not correct. Scrapy stores the output as "a file with json encoded lines"

How to read lines from a json file in scrapy

5 Answers5