1

Possible Duplicate:
Least memory intensive way to read a file in PHP

I have a problem with speed vs. memory usage.

I have a script which needs to be able to run very quickly. All it does is load multiple files from 1-100MB, consisting of a list of values and checks how many of these exist against another list.

My preferred way of doing this is to load the values from the file into an array (explode), and then loop through this array and check whether the value exists or not using isset.

The problem I have is that there are too many values, it uses up >10GB of memory (I don't know why it uses so much). So I have resorted to loading the values from the file into memory a few at a time, instead of just exploding the whole file. This cuts the memory usage right down, but is VERY slow.

Is there a better method?

Code Example:

$check=array('lots','of','values','here');
$check=array_flip($check);
$values=explode('|',file_get_contents('bigfile.txt'));

$matches=0;
foreach($values as $key)
if (isset($check[$key])) $matches++;
Community
  • 1
  • 1
Alasdair
  • 13,348
  • 18
  • 82
  • 138
  • You should provide the code and test-data if you really consider to get some useful answers. – hakre Nov 20 '11 at 12:49
  • It's not a duplicate of that question at all, a very different issue. The code and data is as I described, you don't need to see the code to understand the problem. – Alasdair Nov 20 '11 at 12:50
  • What about code. Maybe with some implementation details it would much easier to give tips to optimize. And maybe example files which you read in. – breiti Nov 20 '11 at 12:54
  • I just added some code while you were commenting since 2 people asked, I thought it was quite obvious. No need to be rude. The question is not a duplicate, but it took you all of 10 seconds to mark it as so, unlikely you even had time to understand the question. – Alasdair Nov 20 '11 at 12:59
  • Thanks for editing. I'd still argue its a duplicate though. When memory is an issue you dont want to load the file into memory but read it sequentially. Apparently its a CSV file because you explode on | so try the various approaches given in the duplicate. – Gordon Nov 20 '11 at 13:01
  • The issue is not in loading the file. I am already loading it in partially, which is too slow, I know how to do this, the problem is that it does not work, so the answer to the other question is irrelevant to this one. My concern is how to make it both fast & not max out the memory. – Alasdair Nov 20 '11 at 13:03
  • You could load the whole file and then use [`strtok`](http://php.net/strtok) to tokenize the contents. That way you don't need an array with the file contents (doubled memory usage _at least_) – knittl Nov 20 '11 at 13:15
  • I am not convinced but will remove the possible duplicate. You should clarify the question though to explain what you are looking for in an answer or rather what you are not looking for. – Gordon Nov 20 '11 at 13:16
  • `strtok` sounds like it could be what I'm looking for. I will test it. Why not post that as an answer? Otherwise I'll have nothing to accept it if works. – Alasdair Nov 20 '11 at 13:19

4 Answers4

2

Maybe you could code your own C extension of PHP (see e.g. this question), or code a small utility program in C and have PHP run it (perhaps using popen)?

Community
  • 1
  • 1
Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
  • 1
    Actually I have already converted the PHP to C++ code using HipHop, so it's not a language issue, it's an implementation issue. Unless there is an entirely different way of storing the variables. – Alasdair Nov 20 '11 at 12:52
  • No, native C++ or C data structures are more efficient than PHP ones. And an automated conversion have to keep the original PHP ones. – Basile Starynkevitch Nov 20 '11 at 12:56
  • Understood. Though I'm not sure how to do this, as I have no real experience with C or C++. – Alasdair Nov 20 '11 at 13:01
  • You can try any other language whose implementation is more efficient than PHP: Ocaml, Common Lisp, Haskell, Java, ... or you can put the informations not in a file, but in a database... – Basile Starynkevitch Nov 20 '11 at 13:04
1

These seems like a classic solution for some form of Key/Value orientated NoSQL datastore (mongodb, couchdb, Riak) (or maybe even just a large memcache instance).

Assuming you can load the large data files into the datastore ahead of when you need to do the searching and that you'll be using the data from the loaded files more than once, you should see some impressive gains (as long your queries, mapreduce, etc aren't awful), judging by the size of your data you may want to look at a data store which doesn't need to hold everything in memory to be quick.

There are plenty of PHP drivers (and tutorials) for each of the datastores I mentioned above.

James Butler
  • 3,852
  • 1
  • 26
  • 38
-1

Open the files and read through them line wise. Maybe use MySQL, for import (LOAD DATA INFILE), for resulting data or both.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
-1

It seems you need some improved search engine.

Sphinx search server can be used for searching your values really fast.

Your Common Sense
  • 156,878
  • 40
  • 214
  • 345