6

I have a file that contains a stream of JSON dictionaries like this:

{"menu": "a"}{"c": []}{"d": [3, 2]}{"e": "}"}

It also includes nested dictionaries and it looks like I cannot rely on a newline being a separator. I need a parser that could be used like this:

for d in getobjects(f):
  handle_dict(d)

The point is that it would be perfect if the iteration only happened at the root level. Is there a Python parser that would handle all JSON's quirks? I am interested in a solution that would work on files that wouldn't fit into RAM.

d33tah
  • 10,999
  • 13
  • 68
  • 158
  • I'd try to split at `}{` or with regex at `}\s*{`. Both are not allowed in JSON outside of strings. If you have that inside strings, it going to be much more complex. – Klaus D. Jun 12 '15 at 17:53
  • I can't be sure I don't have. – d33tah Jun 12 '15 at 17:55
  • Have a look at a JSON parser with streaming API. Using Google, I came across https://pypi.python.org/pypi/ijson/ See especially the example with geographical objects. – s.bandara Jun 12 '15 at 18:05

3 Answers3

6

I think JSONDecoder.raw_decode may be what you're looking for. You may have to do some string formatting to get it in the perfect format depending on newlines and such, but with a bit of work, you'll probably be able to get something working. See this example.

import json
jstring = '{"menu": "a"}{"c": []}{"d": [3, 2]}{"e": "}"}'
substr = jstring
decoder = json.JSONDecoder()

while len(substr) > 0:
    data,index = decoder.raw_decode(substr)
    print data
    substr = substr[index:]

Gives output:

{u'menu': u'a'}
{u'c': []}
{u'd': [3, 2]}
{u'e': u'}'}
Blair
  • 6,623
  • 1
  • 36
  • 42
  • How would you go about making it work with `sys.stdin`? – d33tah Jun 12 '15 at 17:57
  • Replace this line: `jstring = '{"menu": "a"}{"c": []}{"d": [3, 2]}{"e": "}"}'` with something like `jstring = sys.stdin.read()` – Blair Jun 12 '15 at 18:08
  • 1
    It's a huge file, reading it all into the memory is not an option. – d33tah Jun 12 '15 at 18:10
  • 1
    If a single json dict in your file could be bigger than what you want to read into memory, you'll have to find another solution. If they are all small like the ones in your example, you could use exception handling to read the file in manageable-sized chunks. Basically put the interior of the loop in a try block and if you catch an exception (probably a `ValueError`), read in another chunk. – Blair Jun 12 '15 at 18:16
  • I went ahead and wrote the code for the suggested solution above. It's not exactly hard but you do have to be a little careful since `.raw_decode()` doesn't like leading white space. – steveha Jun 13 '15 at 07:10
2

Here you go: a tested solution based on the answer from @Brien

This should be able to handle any arbitrary sized input file. It is a generator, so it yields up dictionary objects one at a time as it parses them out of the JSON input file.

If you run it as a stand-alone, it runs three test cases. (In the if __name__ == "__main__" block)

Of course, to make this read from standard input you would simply pass sys.stdin as the input file argument.

import json


_DECODER = json.JSONDecoder()

_DEFAULT_CHUNK_SIZE = 4096
_MB = (1024 * 1024)
_LARGEST_JSON_OBJECT_ACCEPTED = 16 * _MB  # default to 16 megabytes

def json_objects_from_file(input_file,
            chunk_size=_DEFAULT_CHUNK_SIZE,
            max_size=_LARGEST_JSON_OBJECT_ACCEPTED):
    """
    Read an input file, and yield up each JSON object parsed from the file.

    Allocates minimal memory so should be suitable for large input files.
    """
    buf = ''
    while True:
        temp = input_file.read(chunk_size)
        if not temp:
            break

        # Accumulate more input to the buffer.
        #
        # The decoder is confused by leading white space before an object.
        # So, strip any leading white space if any.
        buf = (buf + temp).lstrip()
        while True:
            try:
                # Try to decode a JSON object.
                x, i = _DECODER.raw_decode(buf)
                # If we got back a dict, we got a whole JSON object.  Yield it.
                if type(x) == dict:
                    # First, chop out the JSON from the buffer.
                    # Also strip any leading white space if any.
                    buf = buf[i:].lstrip()
                    yield x
            except ValueError:
                # Either the input is garbage or we got a partial JSON object.
                # If it's a partial, maybe appending more input will finish it,
                # so catch the error and keep handling input lines.

                # Note that if you feed in a huge file full of garbage, this will grow
                # very large.  Blow up before reading an excessive amount of data.

                if len(buf) >= max_size:
                    raise ValueError("either bad input or too-large JSON object.")
                break
    buf = buf.strip()
    if buf:
        if len(buf) > 70:
            buf = buf[:70] + '...'
        raise ValueError('leftover stuff from input: "{}"'.format(buf))

if __name__ == "__main__":
    from StringIO import StringIO

    jstring = '{"menu":\n"a"}{"c": []\n}\n{\n"d": [3,\n 2]}{\n"e":\n "}"}'
    f = StringIO(jstring)
    correct = [{u'menu': u'a'}, {u'c': []}, {u'd': [3, 2]}, {u'e': u'}'}]

    result = list(json_objects_from_file(f, chunk_size=3))
    assert result == correct

    f = StringIO(' ' * (17 * _MB))
    correct = []

    result = list(json_objects_from_file(f, chunk_size=_MB))
    assert result == correct

    f = StringIO('x' * (17 * _MB))
    correct = "ok"

    try:
        result = list(json_objects_from_file(f, chunk_size=_MB))
    except ValueError:
        result = correct
    assert result == correct
steveha
  • 74,789
  • 21
  • 92
  • 117
0

Here is a partial solution, but it keeps slowing down as input goes:

#!/usr/bin/env pypy

import json
import cStringIO
import sys

def main():
    BUFSIZE = 10240
    f = sys.stdin
    decoder = json.JSONDecoder()
    io = cStringIO.StringIO()

    do_continue = True
    while True:
        read = f.read(BUFSIZE)
        if len(read) < BUFSIZE:
            do_continue = False
        io.write(read)
        try:
            data, offset = decoder.raw_decode(io.getvalue())
            print(data)
            rest = io.getvalue()[offset:]
            if rest.startswith('\n'):
                rest = rest[1:]
            io = cStringIO.StringIO()
            io.write(rest)
        except ValueError, e:
            #print(e)
            #print(repr(io.getvalue()))
            continue
        if not do_continue:
            break

if __name__ == '__main__':
    main()
d33tah
  • 10,999
  • 13
  • 68
  • 158