18

I'm working on a cron script that hits an API, receives JSON file (a large array of objects) and stores it locally. Once that is complete another script needs to parse the downloaded JSON file and insert each object into a MySQL database.

I'm currently using a file_get_contents() along with json_decode(). This will attempt to read the whole file into memory before trying to process it. This would be fine except for the fact that my JSON files will usually range from 250MB-1GB+. I know I can increase my PHP memory limit but that doesn't seem to be the greatest answer in my mind. I'm aware that I can run fopen() and fgets() to read the file in line by line, but I need to read the file in by each json object.

Is there a way to read in the file per object, or is there another similar approach?

Ry-
  • 218,210
  • 55
  • 464
  • 476
Dan Ramos
  • 1,092
  • 2
  • 19
  • 35
  • 2
    this [post](http://stackoverflow.com/questions/4049428/processing-large-json-files-in-php) may help you... –  Mar 12 '13 at 22:36
  • Why are the JSON files so large? – Wayne Whitty Mar 12 '13 at 22:36
  • 4
    good grief! a 1gig response from an API call?? That's insane. Did the developers never hear about the concept of pagination. – Spudley Mar 12 '13 at 22:36
  • Not with the existing `json_decode` functions, reading line by line & parse those yourself seems to be your only choice. There may be 3rd party libraries for the like for instance [this one](https://github.com/kuma-giyomu/JSONParser) (never used it, can t say anything about it, just a quick google turned that up). – Wrikken Mar 12 '13 at 22:37
  • 2
    if the json files are database dumps, there should be another way around .. – Gntem Mar 12 '13 at 22:37
  • 1GB is definitely on the larger end. The data grows with time, so if I run it every 24 hrs I shouldn't really have files over 300-350MB. I had a feeling I was going to have to manually write this parser, just figured id check and see if there was something else already out there. – Dan Ramos Mar 12 '13 at 22:42
  • if you control the API which generating such heavy files or `fopen` the api call and start reading it .. , you could try setting a standard value of MB, which when is reached, the script will start outputting to another file, this way you can maintain a small balance since 10 files with 100MB each (1GB) won't be a pain to process.. – Gntem Mar 12 '13 at 23:04
  • For this purpose I recommend CSV files, one line, one object, so techniques for streaming the file work fine. – santiago arizti Aug 02 '17 at 00:33

3 Answers3

10

try this lib https://github.com/shevron/ext-jsonreader

The existing ext/json which is shipped with PHP is very convenient and simple to use - but it is inefficient when working with large ammounts of JSON data, as it requires reading the entire JSON data into memory (e.g. using file_get_contents()) and then converting it into a PHP variable at once - for large data sets, this takes up a lot of memory.

JSONReader is designed for memory efficiency - it works on streams and can read JSON data from any PHP stream without loading the entire data into memory. It also allows the developer to extract specific values from a JSON stream without decoding and loading all data into memory.

Pawel Dubiel
  • 18,665
  • 3
  • 40
  • 58
6

This really depends on what the json files contain.

If opening the file one shot into memory is not an option, your only other option, as you eluded to, is fopen/fgets.

Reading line by line is possible, and if these json objects have a consistent structure, you can easily detect when a json object in a file starts, and ends.

Once you collect a whole object, you insert it into a db, then go on to the next one.

There isn't much more to it. the algorithm to detect the beginning and end of a json object may get complicating depending on your data source, but I hvae done something like this before with a far more complex structure (xml) and it worked fine.

Kovo
  • 1,687
  • 14
  • 19
  • The structure is pretty basic, 1 large array of objects, each object with the same 3 properties. I'm assuming I would do a `fgets()` , parse that individual string to find all of the JSON objects in it and insert them into the database. I would then reset the pointer to the end of the last successfully found JSON object and repeat. Is that what you were thinking? – Dan Ramos Mar 12 '13 at 22:46
  • Exactly. Since the files have a large variance in size (200mb to 1gb, etc..) it is best to employ a method that works regardless of the size of file. – Kovo Mar 13 '13 at 00:21
3

Best possible solution:

Use some sort of delimiter (pagination, timestamp, object ID etc) that allows you to read the data in smaller chunks over multiple requests. This solution assumes that you have some sort of control of how these JSON files are generated. I'm basing my assumption on:

This would be fine except for the fact that my JSON files will usually range from 250MB-1GB+.

Reading in and processing 1GB of JSON data is simply ridiculous. A better approach is most definitely needed.

Wayne Whitty
  • 19,513
  • 7
  • 44
  • 66