3

I have a json file storing some user information including id, name and url. The json file looks like this:

{"link": "https://www.example.com/user1", "id": 1, "name": "user1"}
{"link": "https://www.example.com/user1", "id": 2, "name": "user2"}

This file was written by a scrapy spider. Now I want to read the urls from the json file and scrape each user's webpage. But I cannot load the data from the json file.

At this time, I have no idea how to get these urls. I think I should read the lines from the json file first. I tried the following code in Python shell:

import json    
f = open('links.jl')    
line = json.load(f)

I got the following error message:

Raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 2 column 1- line 138 column 497(char498-67908)

I did some searches online. The search suggested that the json file may have some formatting issues. But the json file was created and populated with items using scrapy pipeline. Does anybody have a clue what caused the error? And how to solve it? Any suggestions on reading the urls?

Thanks a lot.

petezurich
  • 9,280
  • 9
  • 43
  • 57
Olivia
  • 111
  • 2
  • 6

5 Answers5

7

Those are json lines as the exporter name implies.

Take a look in scrapy.contrib.exporter and see the difference between JsonItemExporter and JsonLinesItemExporter

This should do the trick:

import json

lines = []

with open('links.jl', 'r') as f:
    for line in f:
        lines.append(json.loads(line))
petezurich
  • 9,280
  • 9
  • 43
  • 57
marius_5
  • 501
  • 5
  • 12
1

Hmm... that exception is interesting... I'll just... leave this here (without warranty or good conscience).

import json
import re

parse_err = re.compile(
    r'Extra data: line \d+ column \d+'
    r' - line \d+ column \d+'
    r' \(char (\d*).*')

def recover_bad_json(data):
    while data:
        try:
            yield json.loads(data)
            return
        except ValueError, e:
            char = parse_err.match(e.args[0]).group(1)
            maybe_data, data = data[:int(char)], data[int(char):]
            yield json.loads(maybe_data)

CORPUS = r'''{"link": "https://www.domain.com/user1", "id": 1, "name": "user1"}

{"link": "https://www.domain.com/user1", "id": 2, "name": "user2"}
'''

gen_recovered = recover_bad_json(CORPUS)

print gen_recovered.next()
print gen_recovered.next()
print gen_recovered.next()
SingleNegationElimination
  • 151,563
  • 33
  • 264
  • 304
0

If you suspect that a JSON document may be malformed, I recommend submitting the document to JSONLint. The tool will prettify the document formatting, and highlight any structural or style issues encountered during parsing. I've used this tool in the past to find extra commas and broken quotation marks in JSON document generators.

Willi Ballenthin
  • 6,444
  • 6
  • 38
  • 52
0

I've found this kind of formatting with poorly designed JSON APIs before. This may not be the best solution, but this is a small function I use to convert this kind of an output into a dict that contains all of the result objects in a list.

def json_parse(data):
    d = data.strip().replace("\n\n", ",")
    d = '{"result":[' + d + ']}'
    return json.loads(d)

You may need to tinker with it a bit depending on the number of newlines separating them and such. Read the file with .read() and call json_parse on the data, and you should be able to iterate over everything by accessing data["results"].

It would be better if you could have your scraping results provide valid JSON, but in the mean time something like this can work.

Anorov
  • 1,990
  • 13
  • 19
0

AFAIK, a JSON file should contain a single object. In your case you have several:

{"link": "https://www.domain.com/user1", "id": 1, "name": "user1"}

{"link": "https://www.domain.com/user1", "id": 2, "name": "user2"}

I would do something like:

Python 2.7.3 (default, Sep 26 2012, 21:51:14) 
>>> import json
>>> inpt_json = """{"link": "https://www.domain.com/user1", "id": 1, "name": "user1"}
...     
...     {"link": "https://www.domain.com/user1", "id": 2, "name": "user2"}"""

>>> for line in inpt_json.splitlines():
...     line = line.strip()
...     if line:
...             print json.loads(line)
... 
{u'link': u'https://www.domain.com/user1', u'id': 1, u'name': u'user1'}
{u'link': u'https://www.domain.com/user1', u'id': 2, u'name': u'user2'}
>>> 

So, saying "I have a json file storing some user information..." is not correct. Scrapy stores the output as "a file with json encoded lines"

warvariuc
  • 57,116
  • 41
  • 173
  • 227