1

I'm trying to parse and sift through a very big JSON file, containing tweet metadata of 9gb size. That's why I'm using ijson since this was the one most recommended by the community for such files. Still pretty new at it but I rigged up this function which should store values to a list based on certain conditions. While looping through the different JSONs, it's showing me the following error:

parse error: unallowed token at this point in JSON text
          sitive": false, "lang": "en"},  {"created_at": "Thu Mar 19 1
                     (right here) ------^

I'm not sure what I need to change for this to work. I've got this file after using the Twarc library to hydrate tweets. I'm attaching my sample code below. Did anybody ever encounter this before?

Sample Code:

import ijson

with open(march_20_tweets_path, 'rb') as input_file:
          jsonobj = ijson.items(input_file, 'item', multiple_values=True)
          jsons = (o for o in jsonobj if o['place'] is not None) #error shows here
          for tweet in jsons:
                   #extracting and storing values
rick458
  • 97
  • 6
  • That looks like an unprintable control character has crept into your JSON input. – rici Jun 24 '21 at 15:13
  • Oh, any idea how it could be fixed? This is the default output I got from Twarc – rick458 Jun 24 '21 at 16:13
  • A good start would be to look at the indicated part of the file using a hex editor or something which shows you the actual contents, and see what the offending character is (if my theory is right). – rici Jun 24 '21 at 16:38
  • Alright then, got to look into a hex editor then since the current file is way too large to be viewed by anything usual. Thanks! – rick458 Jun 25 '21 at 17:36
  • `grep -A1 -F '....' | hd` would be a goodstart (replace ... with text from error; read `man grep` and `man hd` if you have doubts ) – rici Jun 25 '21 at 18:53
  • Alright sure, will do! – rick458 Jun 26 '21 at 19:56

1 Answers1

1

This is probably (but not necessarily!) a wrongly written "line-delimited" JSON file. The (yajl-originated, only displayed by ijson) error message doesn't necessarily show the original characters, so it's not immediately evident what the real cause for the error is.

However, this is easily reproducible with a small example (ijson.dump is only available on the master version of ijson, not yet released onto PyPI):

$> echo '0,1' | python -m ijson.dump -M
[...]
ijson.common.IncompleteJSONError: parse error: unallowed token at this point in JSON text
                                      0,1 
                     (right here) ------^

Here there are two top-level JSON documents consisting on the single values 0 and 1. However, the comma doesn't belong to either document, and thus the token (i.e., the comma) is not allowed.

A more relatable example:

$> echo '{},{}' | python -m ijson.dump -M
[...]
ijson.common.IncompleteJSONError: parse error: unallowed token at this point in JSON text
                                     {}, {} 
                     (right here) ------^

If you add a new line in between:

$> echo -e '{},\n\n{}' | python -m ijson.dump -M
[...]
ijson.common.IncompleteJSONError: parse error: unallowed token at this point in JSON text
                                     {},  {} 
                     (right here) ------^

Remove the comma and there's no issue:

$> echo -e '{}\n\n{}' | python -m ijson.dump -M
#: name, value
--------------
0: start_map, None
1: end_map, None
2: start_map, None
3: end_map, None

In summary:

  • Most probably the "line delimited" JSON file is not only delimited by newlines, but it also has commas in between the top-level objects.
  • If that's indeed the case, you can use a tool like sed or similar to remove the trailing commas in your files. You could even do this in a command pipeline to avoid having to write the result back to disk, something like sed 's/\(.*\),\s*/\1/' input.json | python your-script.py with your-script.py reading the contents from sys.stdin.buffer.
  • I could be completely wrong and the source of the unallowed token is somewhere else.
Rodrigo Tobar
  • 569
  • 4
  • 13
  • Oh wow, thanks for such a detailed explanation of what it could be! I'd surely checkout the `sed` tool and see how it works out. I really hope that's the unallowed token though. – rick458 Jun 25 '21 at 17:35