28

I am trying to process somewhat large (possibly up to 200M) JSON files. The structure of the file is basically an array of objects.

So something along the lines of:

[
  {"property":"value", "property2":"value2"},
  {"prop":"val"},
  ...
  {"foo":"bar"}
]

Each object has arbitrary properties and does not necessary share them with other objects in the array (as in, having the same).

I want to apply a processing on each object in the array and as the file is potentially huge, I cannot slurp the whole file content in memory, decoding the JSON and iterating over the PHP array.

So ideally I would like to read the file, fetch enough info for each object and process it. A SAX-type approach would be OK if there was a similar library available for JSON.

Any suggestion on how to deal with this problem best?

The Mighty Rubber Duck
  • 4,388
  • 5
  • 28
  • 27
  • 2
    For maintenance purpose I'd like to keep one language. I'm not familiar with python either, so that would raise other issues if I need to update for some reason. Thanks for offering though! – The Mighty Rubber Duck Oct 29 '10 at 07:43

6 Answers6

16

I decided on working on an event based parser. It's not quite done yet and will edit the question with a link to my work when I roll out a satisfying version.

EDIT:

I finally worked out a version of the parser that I am satisfied with. It's available on GitHub:

https://github.com/kuma-giyomu/JSONParser

There's probably room for some improvement and am welcoming feedback.

The Mighty Rubber Duck
  • 4,388
  • 5
  • 28
  • 27
  • Any progress on this event based parser? – David Higgins Jun 06 '11 at 23:57
  • My json file contains an json_decod'ed array of objects. [{"prop1": "valu", "prop2": "val2", "prop3": "val3", "pro4": "val4"}, {"prop1": "valu", "prop2": "val2", "prop3": "val3", "pro4": "val4"}..... ] Parsing fails for this data. Any recommendation? – Gaurav Phapale Dec 30 '13 at 12:04
  • @GauravPhapale It seems that the parser does not currently support top level arrays. Should be a breeze to fix though. – The Mighty Rubber Duck Jan 06 '14 at 01:00
  • 1
    @GauravPhapale I pushed an update that fixes the broken behaviour and got rid of another bug (strings not being accepted in arrays). That should teach me to write exhaustive tests. – The Mighty Rubber Duck Jan 06 '14 at 03:09
6

Recently I made a library called JSON Machine, which efficiently parses unpredictably big JSON files. Usage is via simple foreach. I use it myself for my project.

Example:

foreach (JsonMachine::fromFile('employees.json') as $employee) {
    $employee['name']; // etc
}

See https://github.com/halaxa/json-machine

Filip Halaxa
  • 728
  • 6
  • 13
  • @gumuruh I guess because my answer is much more recent. – Filip Halaxa Dec 03 '20 at 16:30
  • I know I'm late, and I'll probably open a Github issue request, but how do you use your tool `Json Machine` without installing it via Composer? It does mention you can clone the repo but it's not recommended. Any other safe way? – Robin Jul 30 '21 at 10:33
2

This is a simple, streaming parser for processing large JSON documents. Use it for parsing very large JSON documents to avoid loading the entire thing into memory, which is how just about every other JSON parser for PHP works.

https://github.com/salsify/jsonstreamingparser

Aaron Averill
  • 233
  • 1
  • 7
2

There exists something like this, but only for C++ and Java. Unless you can access one of these libraries from PHP, there's no implementation for this in PHP but json_read() as far as I know. However, if the json is structured that simple, it's easy to just read the file until the next } and then process the JSON received via json_read(). But you should better do that buffered, like reading 10kb, split by }, if not found, read another 10k, and else process the found values. Then read the next block and so on..

joni
  • 5,402
  • 1
  • 27
  • 40
  • Well, the objects can potentially have objects as properties. I have no control over the content of the objects themselves. Sounds like a job of for a lexer/parser or I could slice it by hand by counting `{` and `}`'s. I'd like to avoid getting down to that though. – The Mighty Rubber Duck Oct 29 '10 at 06:30
0

There is http://github.com/sfalvo/php-yajl/ I didn't use it myself.

Alex Jasmin
  • 39,094
  • 7
  • 77
  • 67
0

I know that the JSON streaming parser https://github.com/salsify/jsonstreamingparser has already been mentioned. But as I have recently(ish) added a new listener to it to try and make it easier to use out of the box I thought I would (for a change) put some information out about what it does...

There is a very good write up about the basic parser at https://www.salsify.com/blog/engineering/json-streaming-parser-for-php, but the issue I have with the standard setup was that you always had to write a listener to process a file. This is not always a simple task and can also take a certain amount of maintenance if/when the JSON changed. So I wrote the RegexListener.

The basic principle is to allow you to say what elements you are interested in (via a regex expression) and give it a callback to say what to do when it finds the data. Whilst reading the JSON, it keeps track of the path to each component - similar to a directory structure. So /name/forename or for arrays /items/item/2/partid- this is what the regex matches against.

An example is (from the source on github)...

$filename = __DIR__.'/../tests/data/example.json';
$listener = new RegexListener([
    '/1/name' => function ($data): void {
        echo PHP_EOL."Extract the second 'name' element...".PHP_EOL;
        echo '/1/name='.print_r($data, true).PHP_EOL;
    },
    '(/\d*)' => function ($data, $path): void {
        echo PHP_EOL."Extract each base element and print 'name'...".PHP_EOL;
        echo $path.'='.$data['name'].PHP_EOL;
    },
    '(/.*/nested array)' => function ($data, $path): void {
        echo PHP_EOL."Extract 'nested array' element...".PHP_EOL;
        echo $path.'='.print_r($data, true).PHP_EOL;
    },
]);
$parser = new Parser(fopen($filename, 'r'), $listener);
$parser->parse();

Just a couple of explanations...

'/1/name' => function ($data)

So the /1 is the the second element in an array (0 based), so this allows accessing particular instances of elements. /name is the name element. The value is then passed to the closure as $data

"(/\d*)" => function ($data, $path )

This will select each element of an array and pass it one at a time, as it's using a capture group, this information will be passed as $path. This means when a set of records is present in a file, you can process each item one at a time. And also know which element without having to keep track.

The last one

'(/.*/nested array)' => function ($data, $path):

effectively scans for any elements called nested array and passes each one along with where it is in the document.

Another useful feature I found was that if in a large JSON file, you just wanted the summary details at the top, you can grab those bits and then just stop...

$filename = __DIR__.'/../tests/data/ratherBig.json';
$listener = new RegexListener();
$parser = new Parser(fopen($filename, 'rb'), $listener);
$listener->setMatch(["/total_rows" => function ($data ) use ($parser) {
    echo "/total_rows=".$data.PHP_EOL;
    $parser->stop();
}]);

This saves time when you are not interested in the remaining content.

One thing to note is that these will react to the content, so that each one is triggered when the end of the matching content is found and may be in various orders. But also that the parser only keeps track of the content you are interested in and discards anything else.

If you find any interesting features (sometimes horribly know as bugs), please let me know or report an issue on the github page.

Nigel Ren
  • 56,122
  • 11
  • 43
  • 55