17

I have a PHP script that builds a binary search tree over a rather large CSV file (5MB+). This is nice and all, but it takes about 3 seconds to read/parse/index the file.

Now I thought I could use serialize() and unserialize() to quicken the process. When the CSV file has not changed in the meantime, there is no point in parsing it again.

To my horror I find that calling serialize() on my index object takes 5 seconds and produces a huge (19MB) text file, whereas unserialize() takes unbearable 27 seconds to read it back. Improvements look a bit different. ;-)

So - is there a faster mechanism to store/restore large object graphs to/from disk in PHP?

(To clarify: I'm looking for something that takes significantly less than the aforementioned 3 seconds to do the de-serialization job.)

Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • Why not store the information that is in the file into a database? – RJD22 Mar 30 '10 at 13:24
  • Because the script is part of a tool that specifically does not want to use a database dependency. – Tomalak Mar 30 '10 at 13:26
  • What do your index objects look like? – user187291 Mar 30 '10 at 13:29
  • If you have full access to the web service writing a PHP extension module specifically for faster IP2country searches could be an option. Also a service that monitors the CSV file modification date and provides the data via a named pipe could also fit your needs. – Robert Mar 30 '10 at 13:32
  • @stereofrog: It is a tree of nested node objects, each having a `$value` (float), a `$payload` (string) and `$left` and `$right` node references. Nothing fancy, but it contains > 100,000 of such objects. – Tomalak Mar 30 '10 at 13:33
  • @Robert: I am looking for a self-contained, PHP-only solution, something that has no implications on platform or other installed software (like, a DB server). – Tomalak Mar 30 '10 at 13:35
  • can the tree be expected to be reasonably balanced? – goat Mar 30 '10 at 15:57
  • Do you _really_ need the entire thing in memory as a tree? Or, ultimately, maybe you just want to be able to find a payload fast given a value? How many lookups do you do per script execution? – goat Mar 30 '10 at 16:03

8 Answers8

13

var_export should be lots faster as PHP won't have to process the string at all:

// export the process CSV to export.php
$php_array = read_parse_and_index_csv($csv); // takes 3 seconds
$export = var_export($php_array, true);
file_put_contents('export.php', '<?php $php_array = ' . $export . '; ?>');

Then include export.php when you need it:

include 'export.php';

Depending on your web server set up, you may have to chmod export.php to make it executable first.

dave1010
  • 15,135
  • 7
  • 67
  • 64
  • 8
    I know this is old, but there is a better way, still using the same code. instead of having `file_put_contents('export.php', '');`, just use `file_put_contents('export.php', '');`. And instead of `include 'export.php';`, use `$data = include 'export.php';`. – Ismael Miguel Mar 06 '15 at 09:35
  • This is an awesome solution. I always use var_export 'ed datas in includes, and this makes it a little easier ! – Gfra54 Oct 15 '15 at 10:10
  • Reading 27MB of data in var_export format was horribly slow. Creating the var_export was very quick. – William Desportes May 06 '21 at 18:55
6

Try igbinary...did wonders for me:

http://pecl.php.net/package/igbinary

Asad Hasan
  • 301
  • 4
  • 9
5

First you have to change the way your program works. divide CSV file to smaller chunks. This is an IP datastore i assume. .

Convert all IP addresses to integer or long.

So if a query comes you can know which part to look. There are <?php ip2long() /* and */ long2ip(); functions to do this. So 0 to 2^32 convert all IP addresses into 5000K/50K total 100 smaller files. This approach brings you quicker serialization.

Think smart, code tidy ;)

Peter O.
  • 32,158
  • 14
  • 82
  • 96
Baris CUHADAR
  • 59
  • 1
  • 2
4

It seems that the answer to your question is no.

Even if you discover a "binary serialization format" option most likely even that would be to slow for what you envisage.

So, what you may have to look into using (as others have mentioned) is a database, memcached, or on online web service.

I'd like to add the following ideas as well:

  • caching of requests/responses
  • your PHP script does not shutdown but becomes a network server to answer queries
  • or, dare I say it, change the data structure and method of query you are currently using
zaf
  • 22,776
  • 12
  • 65
  • 95
  • You have a rich data source which offers many creative ideas, I'm sure you'll come up with something very smooth. – zaf Apr 08 '10 at 17:58
2

i see two options here

string serialization, in the simplest form something like

  write => implode("\x01", (array) $node);
  read  => explode() + $node->payload = $a[0]; $node->value = $a[1] etc

binary serialization with pack()

  write => pack("fnna*", $node->value, $node->le, $node->ri, $node->payload);
  read  => $node = (object) unpack("fvalue/nre/nli/a*payload", $data);

It would be interesting to benchmark both options and compare the results.

user187291
  • 53,363
  • 19
  • 95
  • 127
  • The tree has a root node. Would it be enough to `pack()` that root node, I mean would it pack the entire graph? – Tomalak Mar 30 '10 at 14:19
  • 2
    Then it is not an option, I'm afraid. :-\ – Tomalak Mar 30 '10 at 15:09
  • @Tomalak I would like to enlist your help on an unrelated question here on stack overflow about passing byte arrays to a COM object method by reference. Here it is http://stackoverflow.com/questions/42189245/how-to-pass-an-array-of-bytes-reference-to-a-com-object-method As I pored over the internet I came across related questions posted by persons who were stuck in the same rut here https://bugs.php.net/bug.php?id=41286&thanks=3 I am banking on your expertise to show me how to do it if you don't mind please. I will be so grateful for your help. – Joseph Feb 14 '17 at 08:59
1

If you want speed, writing to or reading from the file system in less than optimal.

In most cases, a database server will be able to store and retrieve data much more efficiently than a PHP script that is reading/writing files.

Another possibility would be something like Memcached.

Object serialization is not known for its performance but for its ease of use and it's definitely not suited to handle large amounts of data.

selfawaresoup
  • 15,473
  • 7
  • 36
  • 47
  • Is there no binary serialization format for PHP that writes memory bytes to the disk and simply reads them back again? If the CSV is all strings and the index object actually contains less info than the text file, why must its serialized form be so bloated? – Tomalak Mar 30 '10 at 13:38
  • @Tomalak: check out pack/unpack – Robert Mar 30 '10 at 13:47
  • @Robert: Looks like pack works for individual values only, not for complex objects. – Tomalak Mar 30 '10 at 14:00
  • @tomalak: serialize is slower because it does a lot of things that you don't always see when it comes to objects and classes. It also relies heavily on recursion to build a string representation of nested data structures which may also be slow. I think, when you already have table oriented data (csv) a relational database is the best option. – selfawaresoup Mar 30 '10 at 17:32
0

SQLite comes with PHP, you could use that as your database. Otherwise you could try using sessions, then you don't have to serialize anything, you just saving the raw PHP object.

Brent Baisley
  • 12,641
  • 2
  • 26
  • 39
  • Can I share the object between sessions in PHP? – Tomalak Sep 08 '10 at 08:44
  • You couldn't share it between different sessions. Although you could probably get everyone using the same session by setting a custom session ID. Otherwise you would have to look into using shared memory. http://php.net/manual/en/book.shmop.php – Brent Baisley Sep 08 '10 at 10:33
  • Just a quick note in case anyone stumbles upon it - do **NOT** use sessions for storing large objects, and even more so - do **NOT** let people share the same session. This defeats the purpose of using a session in the first place - and, since only one user can access one session id at a time, it will effectively limit request processing to only **one**! Session has to load from disk/database anyway! – SteveB Mar 04 '15 at 15:28
  • 1
    @SteveB Admittedly, the contexts were obscure, but i have used large data-sets in shared/fixed sessions in multiple apps before. If you are building a-typical apps, a-typical solutions are often good ones. – hiburn8 Jul 04 '19 at 15:00
  • 1
    @hiburn8 I can agree with that. If you're fixing a particular issue then it might be a sound idea. Exploring every option available is something I would respect. I might have been too prejudiced based on my experiences. – SteveB Jul 05 '19 at 16:17
0

What about using something like JSON for a format for storing/loading the data? I have no idea how fast the JSON parser is in PHP, but it's usually a fast operation in most languages and it's a lightweight format.

http://php.net/manual/en/book.json.php

Daniel Beardsley
  • 19,907
  • 21
  • 66
  • 79
  • Yes that would work for data, not for object graphs. I was looking for something that dumps the entire object graph to disk so I would have no penalty for re-creating it (in terms of parsing, error checking, object construction). – Tomalak Jul 12 '11 at 20:47
  • JSON cannot represent references. It can represent hierarchies. It's not even necessary to have cyclical references, as soon as there is a `parent` reference, it's over. Besides, serializing/un-serializing is absolutely not what I had in mind. – Tomalak Jul 14 '11 at 09:05
  • You are right, it cannot represent references. Though a `parent` reference would make the object graph cyclical, i.e. being able to get to someplace you had previously been. Hmm... you could have a `sibling` reference and it would still be a-cyclical, making my previous statement wrong. – Daniel Beardsley Jul 17 '11 at 01:49
  • I don't know about _fast_ or memory-efficient, but I have an almost-working [implementation of a JSON-serializer](http://stackoverflow.com/questions/10489876/serialize-unserialize-php-object-graph-to-json-partially-working-solution-need) (and un-serializer) for object-graphs, which does support cyclical references. I don't know if this is what you're looking for - my gut feeling is, the amount of data you're wrestling with is probably better off in a database. – mindplay.dk May 08 '12 at 02:53
  • A restriction for JSON is that json_encode requires that the string values are in UTF-8 encoding. – Jānis Elmeris Sep 09 '13 at 12:31