10

I know my question has a lot of answers on the internet but it's seems i can't find a good answer for it, so i will try to explain what i have and hope for the best,

so what i'm trying to do is reading a big json file that might be has more complex structure "nested objects with big arrays" than this but for simple example:

{
  "data": {
    "time": [
      1,
      2,
      3,
      4,
      5,
       ...
    ],
    "values": [
      1,
      2,
      3,
      4,
      6,
       ...
    ]
  }
}

this file might be 200M or more, and i'm using file_get_contents() and json_decode() to read the data from the file,

then i put the result in variable and loop over the time and take the time value with the current index to get the corresponding value by index form the values array, then save the time and the value in the database but this taking so much CPU and Memory, is their a better way to do this

a better functions to use, a better json structure to use, or maybe a better data format than json to do this

my code:

$data = json_decode(file_get_contents(storage_path("test/ts/ts_big_data.json")), true);
        
foreach(data["time"] as $timeIndex => timeValue) {
    saveInDataBase(timeValue, data["values"][timeIndex])
}

thanks in advance for any help

Update 06/29/2020:

i have another more complex json structure example

{
      "data": {
        "set_1": {
          "sub_set_1": {
            "info_1": {
              "details_1": {
                "data_1": [1,2,3,4,5,...],
                "data_2": [1,2,3,4,5,...],
                "data_3": [1,2,3,4,5,...],
                "data_4": [1,2,3,4,5,...],
                "data_5": 10254552
              },
              "details_2": [
                [1,2,3,4,5,...],
                [1,2,3,4,5,...],
                [1,2,3,4,5,...],
              ]
            },
            "info_2": {
              "details_1": {
                "data_1": {
                  "arr_1": [1,2,3,4,5,...],
                  "arr_2": [1,2,3,4,5,...]
                },
                "data_2": {
                 "arr_1": [1,2,3,4,5,...],
                  "arr_2": [1,2,3,4,5,...]
                },
                "data_5": {
                  "text": "some text"
                }
              },
              "details_2": [1,2,3,4,5,...]
            }
          }, ...
        }, ...
      }
    } 

the file size might be around 500MB or More and the arrays inside this json file might have around 100MB of data or more.

and my question how can i get any peace and navigate between nodes of this data with the most efficient way that will not take much RAM and CPU, i can't read the file line by line because i need to get any peace of data when i have to,

is python for example more suitable for handling this big data with more efficient than php ?

please if you can provide a detailed answer i think it will be much help for every one that looking to do this big data stuff with php.

Fadi
  • 2,320
  • 8
  • 38
  • 77
  • Can you share some code ? Try to use [next](https://www.php.net/manual/en/function.next.php) instead to use index to reach the data – Ôrel Jun 25 '20 at 15:50
  • If you can get the file just to be written as a CSV with `time,value` then it would just be reading a line at a time. Either that or each item being `{ "time":"12:00", "value": 1}` – Nigel Ren Jun 25 '20 at 16:07
  • @NigelRen thanks, but in case if i have complex structure not simple like this what can we do – Fadi Jun 25 '20 at 16:12
  • This again depends on the structure, being able to optimize something you cannot see is not that easy. – Nigel Ren Jun 25 '20 at 16:17
  • you might want to look at this https://stackify.com/a-guide-to-streams-in-php-in-depth-tutorial-with-examples/ – opensource-developer Jun 25 '20 at 16:19
  • @opensource-developer, [this](https://stackoverflow.com/questions/4049428/processing-large-json-files-in-php) might be more relevant, but this only presents the data in parts. If you need to link different bits of the data together this is where it becomes a problem. – Nigel Ren Jun 25 '20 at 16:21
  • A common solution to this is to store complex data in multiple simpler structures. Your real problem here is likely the sheer size of the storage file, and the fact that you can't parse it line-by-line because it's son. Also, make sure `saveInDataBase()` isn't reconnecting to the database each time. (login is a very time consuming operation). Also consider generating INSERT/UPDATE queries that modify 500 or 1000 rows at a time and then running those, rather than running INSERTS/UPDATES that only modify 1 row. (Each run is a new TCP request to the DB which are also slow) – Nicholas Summers Jun 25 '20 at 16:26
  • @NicholasSummers thanks for your advice on the database i'm already doing a transaction and saving each 1000 point at time then committing the transaction, my real problem is with the `file_get_contents()` and `json_decode()` – Fadi Jun 25 '20 at 16:31
  • Maybe you can use the [MySQL Shell](https://dev.mysql.com/doc/mysql-shell/8.0/en/mysql-shell-utilities-json.html) and import the JSON directly from the file into a table and then use the standard [json commands](https://dev.mysql.com/doc/refman/8.0/en/json.html) for your queries? (I never tried that BTW) – Alon Eitan Jun 25 '20 at 16:44
  • Maybe this can help? https://stackoverflow.com/questions/4049428/processing-large-json-files-in-php – bestprogrammerintheworld Jun 25 '20 at 17:36
  • Where does this json file comes from? Are you able to manipulate the format or structure of json nodes? And is there any upper bound to the size of json file – Vinay Jul 05 '20 at 05:06

7 Answers7

10

JSON is a great format and way better alternative to XML. In the end JSON is almost one on one convertible to XML and back.

Big files can get bigger, so we don't want to read all the stuff in memory and we don't want to parse the whole file. I had the same issue with XXL size JSON files.

I think the issue lays not in a specific programming language, but in a realisation and specifics of the formats.

I have 3 solutions for you:

  1. Native PHP implementation (preferred)

Almost as fast as streamed XMLReader, there is a library https://github.com/pcrov/JsonReader. Example:

use pcrov\JsonReader\JsonReader;

$reader = new JsonReader();
$reader->open("data.json");

while ($reader->read("type")) {
    echo $reader->value(), "\n";
}
$reader->close();

This library will not read the whole file into memory or parse all the lines. It is step by step on command traverse through the tree of JSON object.

  1. Let go formats (cons: multiple conversions)

Preprocess file to a different format like XML or CSV. There is very lightweight nodejs libs like https://www.npmjs.com/package/json2csv to CSV from JSON.

  1. Use some NoSQL DB (cons: additional complex software to install and maintain)

For example Redis or CouchDB(import json file to couch db-)

Max Kaps 4bis.nl
  • 1,341
  • 7
  • 16
  • 1
    @JEY agreed. In my experience I've used all 3 options. JsonReader that uses streams is the best. I've changed order. (please change your comment, so readers would no be confused) – Max Kaps 4bis.nl Jul 02 '20 at 15:17
  • I didn’t know that a streaming parser existed for JSON, good to know! I think that is definitely a great option. – Chris Haas Jul 04 '20 at 12:54
  • Importing into redis is fine but you will suffer it while reading(other requests also will wait). Since it will be a one huge string(200MB), you have to wait/transfer the whole data in network. – Ersoy Jul 05 '20 at 11:17
  • 1
    @MaxKaps4bis.nl thanks for your help, and i think you solution is not 100%, but the `JsonReader` is the best answer till now, i will accept you answer so you can get the bounty, and i will do some tests on the `JsonReader` this week and put some examples and added as answer. – Fadi Jul 05 '20 at 11:23
  • @Ersoy you need to do the processing in the background for sure, and if you want you can upload the file as request form some ui or send by api for background uploading also , so this ok – Fadi Jul 05 '20 at 11:24
  • Background processing is for the client, i am talking about the server @Fadi. Redis is single threaded and when you read/write 220MB all other requests have to wait until the read/write process is completed. – Ersoy Jul 05 '20 at 11:30
  • Another streaming parse is json-streaming-parser which is also a PHP implementation, each has their own benefits/problems. – Nigel Ren Jul 05 '20 at 14:07
5

Your problem is basically related to the memory management performed by each specific programming language that you might use in order to access the data from a huge (storage purpose) file.

For example, when you amass the operations by using the code that you just mentioned (as below)

$data = json_decode(file_get_contents(storage_path("test/ts/ts_big_data.json")), true);

what happens is that the memory used by runtime Zend engine increases too much, because it has to allocate certain memory units to store references about each ongoing file handling involved in your code statement - like keeping also in memory a pointer, not only the real file opened - unless this file gets finally overwritten and the memory buffer released (freed) again. It's no wonder that if you force the execution of both file_get_contents() function that reads the file into a string and also the json_decode() function, you force the interpreter to keep in memory all 3 "things": the file itself, the reference created (the string), and also the structure (the json file).

On the contrary if you break the statement in several ones, the memory stack hold by the first data structure (the file) will be unloaded when the operation of "getting its content" then writing it into another variable (or file) is fully performed. As time as you don't define a variable where to save the data, it will still stay in the memory (as a blob - with no name, no storage address, just content). For this reason, it is much more CPU and RAM effective - when working with big data - to break everything in smaller steps.

So you have first to start by simply rewriting your code as follows:

$somefile = file_get_contents(storage_path("test/ts/ts_big_data.json"));

$data = json_decode($somefile, true);

When first line gets executed, the memory hold by ts_big_data.json gets released (think of it as being purged and made available again to other processes).

When second line gets executed, also $somefile's memory buffer gets released, too. The take away point from this is that instead of always having 3 memory buffers used just to store the data structures, you'll only have 2 at each time, if of course ignoring the other memory used to actually construct the file. Not to say that when working with arrays (and JSON files just exactly arrays they are), that dynamically allocated memory increases dramatically and not linear as we might tend to think. Bottom line is that instead of a 50% loss in performance just on storage allocation for the files (3 big files taking 50% more space than just 2 of them), we better manage to handle in smaller steps the execution of the functions 'touching' these huge files.

In order to understand this, imagine that you access only what is needed at a certain moment in time (this is also a principle called YAGNI -You Aren't Gonna Need It - or similar in the context of Extreme Programming Practices - see reference here https://wiki.c2.com/?YouArentGonnaNeedIt something inherited since the C or Cobol old times.

The next approach to follow is to break the file in more pieces, but in a structured one (relational dependent data structure) as is in a database table / tables.

Obviously, you have to save the data pieces again as blobs, in the database. The advantage is that the retrieval of data in a DB is much more faster than in a file (due to the allocation of indexes by the SQL when generating and updating the tables). A table having 1 or two indexes can be accessed in a lightning fast manner by a structured query. Again, the indexes are pointers to the main storage of the data.

One important topic however is that if you still want to work with the json (content and type of data storage - instead of tables in a DB) is that you cannot update it locally without changing it globally. I am not sure what you meant by reading the time related function values in the json file. Do you mean that your json file is continuously changing? Better break it in several tables so each separate one can change without affecting all the mega structure of the data. Easier to manage, easier to maintain, easier to locate the changes.

My understanding is that best solution would be to split the same file in several json files where you strip down the not needed values. BY THE WAY, DO YOU ACTUALLY NEED ALL THE STORED DATA ??

I wouldn't come now with a code unless you explain me the above issues (so we can have a conversation) and thereafter I will accordingly edit my answer. I wrote yesterday a question related to handling of blobs - and storing in the server - in order to accelerate the execution of a data update in a server using a cron process. My data was about 25MB+ not 500+ as in your case however I must understand the use case for your situation.

One more thing, how was created that file that you must process ? Why do you manage only the final form of it instead of intervening in further feeding it with data ? My opinion is that you might stop storing data into it as previously done (and thus stop adding to your pain) and instead transform its today purpose only into historic data storage from now on then go toward storing the future data in something more elastic (as MongoDB or NoSQL databases).

Probably you don't need so much a code as a solid and useful strategy and way of working with your data first.

Programming comes last, after you decided all the detailed architecture of your web project.

Eve
  • 357
  • 3
  • 12
  • 1
    It isn't necessarily Fadi who has downvoted, I hope that OP can see beyond what may sometimes be tactical downvoting to make answers look bad. But as you can see, without any feedback the downvote is meaningless and although it may be annoying it's the games some people play. – Nigel Ren Jul 02 '20 at 16:58
1

My approach will be reading the JSON FILE in chunks.

If these json objects have a consistent structure, you can easily detect when a json object in a file starts, and ends.

Once you collect a whole object, you insert it into a db, then go on to the next one.

There isn't much more to it. the algorithm to detect the beginning and end of a json object may get complicating depending on your data source, but I hvae done something like this before with a far more complex structure (xml) and it worked fine.

Above answer is taken from => Parse large JSON file

Please see the below references, it can be helpful for your case

=> https://laracasts.com/discuss/channels/general-discussion/how-to-open-a-28-gb-json-file-in-php

Aabir Hussain
  • 1,161
  • 1
  • 11
  • 29
  • can you please provide a working code ? with example data? like can you please put the needed steps in your code like first how we should use the parser and then loop over the data and how we can navigate the nodes ? – Fadi Jun 30 '20 at 10:33
0

and my question how can i get any peace and navigate between nodes of this data with the most efficient way that will not take much RAM and CPU, i can't read the file line by line because i need to get any peace of data when i have to,

It's plain text JSON and you have no indexes, so it's impossible to parse your data without iterating it line-by-line. The solution is to serialize your data once and for all and store it in a database (I'm thinking SQLite for fast setup).

If you mandatory can't store your data in a database, or can't retrieve it in SQLite format, you have no other choice but to create a queue job which will parse it in time.

user8555937
  • 2,161
  • 1
  • 14
  • 39
0

As you say correctly you won't get around with reading line per line. Using SQL as suggested just moves the problem to another environment. I would personally do it this way:

  1. When a new JSON file comes in, put it in a storage, easiest would be S3 with Storage::disk('s3')->put(...); (https://laravel.com/docs/7.x/filesystem) and put it in a queue. You could use Laravel queue or what I prefer, RabbitMQ. Add to the queue a new entry, like {'job': 'parseMyJSON', 'path': 'https://path-on.s3'}
  2. Create a new server instance that can access the queue
  3. Write a worker instance of your app, that can take a job from the queue. Run it on the new server from 2. Whenever you put the job into the queue, it will get the JSON file from S3 and do the necessary job. Then it will take the next job from the queue, one by one.

If this worker instance is written in Python or PHP you have to test what will work faster. The advantage of this is, that you can scale the workers as how much you need them. And it won't affect the performance of your webapp. I hope this helps you.

Maxi
  • 415
  • 4
  • 24
  • thanks for your help, i'm already using laravel queue and workers so i can do the job in the background so the main app will not be so much effected, and your solution sound great but i think it's not the case for my question, because for example what if i can't do 's3' or use 2 servers, i'm looking for a way or a data structure that can handle big json files without taking so much memory and cpu this my case, and realy thanks for your help again. – Fadi Jun 30 '20 at 09:20
-1

Try Reducing You Bulk Data Complexity For Faster File I/O

JSON is a great format to store data in, but it comes at the cost of needing to read the entire file to parse it.

Making your data structure simpler but more spread out across several files can allow you to read a file line-by-line which is much faster than all-at-once. This also comes with the benefit of not needing to store the entire file in RAM all at once, so it is more friendly to resource-limited enviroments.

This might look something like this:

objects.json

{
  "data": {
    "times_file": "/some/path/objects/object-123/object-123-times.csv",
    "values_file": "/some/path/objects/object-123/object-123-times.csv"
  }
}

object-123-times.csv

1
2
3
4
...

This would allow you to store your bulk data in a simpler but easier to access format. You could then use something like fgetcsv() to parse each line.

Nicholas Summers
  • 4,444
  • 4
  • 19
  • 35
-2

You may Split your arrays into chunks using

array_chunk() Function

The array_chunk() function is an inbuilt function in PHP which is used to split an array into parts or chunks of given size depending upon the parameters passed to the function. The last chunk may contain fewer elements than the desired size of the chunk.

Check the examples in this link

Foued MOUSSI
  • 4,643
  • 3
  • 19
  • 39
  • 2
    thanks for you help, but i think the first problem is with the `file_get_contents()` and `json_decode()` right ? they are taking so much CPU and Memory to open and convert the json file to an array, please let me know what you think how you would like to use the `array_chunk()` in this case – Fadi Jun 25 '20 at 16:04
  • 1
    This advice will only generate a deeper / more complex data structure. – mickmackusa Jun 25 '20 at 16:19
  • If anything, this may actually make it take longer – Nicholas Summers Jun 25 '20 at 16:27
  • @mickmackusa, Thank you for your comment, could you please provide more explanations ? – Foued MOUSSI Jun 25 '20 at 16:57