In http://www.githubarchive.org/ that Ilya Grigorik has provided ,I found that in many gz files , some consecutive events are logged to same file .
for example in 2011-03-15-21.json.gz
To get the above do : wget http://data.githubarchive.org/2011-03-15-21.json.gz
In this gz for example if you search for id 1484832 , you can find that the 2 consecutive events(jsons) are in same line see http://codebeautify.org/jsonviewer/2cb891
the two jsons in same line is a combination of
http://codebeautify.org/jsonviewer/c7e18e
and
http://codebeautify.org/jsonviewer/945d56
.
What is the impact ? when I was loading each line and loading it with python's(why python ? because I felt python is comfortable in dealing with jsons) json.loads it said it was invalid as it was a combination of two jsons .
Question :
1) How did you solve these kind of bugs when you processed that github archive data ?
2) I already have the data in my local . so how can I overcome this problem . Shall I write code specific to this case to overcome ? the code i wrote was like
jsonlist = line.split('}{')
json.loads(jsonlist[0] + '}', "ISO-8859-1") # load and navigate through this json
json.loads('{' + jsonlist[1], "ISO-8859-1") # load and navigate through this json