13

Imagine you have the following array of integers:

array(1, 2, 1, 0, 0, 1, 2, 4, 3, 2, [...] );

The integers go on up to one million entries; only instead of being hardcoded they've been pre-generated and stored in a JSON formatted file (of approximately 2MB in size). The order of these integers matters, I can't randomly generate it every time because it should be consistent and always have the same values at the same indexes.

If this file is read back in PHP afterwards (e.g. using file_get_contents + json_decode) it takes from 700 to 900ms just to get the array back — "Okay" I thought, "it's probably reasonable since json_decode has to parse about 2 million characters, let's cache it". APC caches it in a entry that takes about 68MB, probably normal, zvals are large. Retrieving however this array back from APC also takes some good 600ms which is in my eyes still way too much.

Edit: APC does serialize/unserialize to store and retrieve content which with a million item array is a lengthy and heavy process.

So the questions:

  • Should I expect this latency if I intend to load a one million entries array, no matter the data store or the method, in PHP? As far as I understand APC stores the zval itself, so theoretically retrieving it from APC should be as fast as it can possibly get (no parsing, no conversion, no disk access)

  • Why is APC so slow for something so seemingly simple?

  • Is there any efficient way to load a one million entries array entirely in memory using PHP? assuming RAM usage is not a problem.

  • If I were to access only slices of this array based on indexes (e.g. loading the chunk from index 15 to index 76) and never actually have the entire array in memory (yes, I understand this is the sane way of doing it, but I wanted to know all the sides), what would be the most efficient data store system for the complete array? Obviously not a RDBM; I'm thinking redis, but I would be happy to hear other ideas.

Community
  • 1
  • 1
Mahn
  • 16,261
  • 16
  • 62
  • 78
  • 1
    Have you tried [SplFixedArray](http://php.net/manual/en/class.splfixedarray.php)? – Buddy Jul 28 '12 at 15:24
  • @Buddy yup, not much difference, probably uses less memory but APC takes equally long. – Mahn Jul 28 '12 at 15:27
  • 1
    If the numbers are small and the array is static, cannot you use a single 1Mb string object instead? – 6502 Jul 28 '12 at 15:31
  • @6502 Good idea. I'm going to look into that. – Mahn Jul 28 '12 at 15:33
  • Redis [lists](http://redis.io/topics/data-types#lists) are also worth a try. The only problem here is that you'll need to store ints as strings, but I'd definitely try that. – Buddy Jul 28 '12 at 16:03
  • While I would look for external solutions to this problem, it wouldn't be very hard to write your own PHP extension that loaded a sequence of integers from a file in a compact, single allocated block of memory and exposed it as an array to the userland script. – Matthew Jul 28 '12 at 17:58
  • You should ask yourself, why would you need a single array contains 1 million values. For fun? Hint: break the one big array into 1000 (or more) smaller array. – ajreal Jul 28 '12 at 18:18
  • @Buddy yep, that's also something I'm looking into, still haven't decided though. – Mahn Jul 30 '12 at 01:29

4 Answers4

3

Say the integers are all 0-15. Then you can store 2 per byte:

<?php
$data = '';
for ($i = 0; $i < 500000; ++$i)
  $data .= chr(mt_rand(0, 255));

echo serialize($data);

To run: php ints.php > ints.ser

Now you have a file with a 500000 byte string containing 1,000,000 random integers from 0 to 15.

To load:

<?php
$data = unserialize(file_get_contents('ints.ser'));

function get_data_at($data, $i)
{
  $data = ord($data[$i >> 1]);

  return ($i & 1) ? $data & 0xf : $data >> 4;
}

for ($i = 0; $i < 1000; ++$i)
  echo get_data_at($data, $i), "\n";

The loading time on my machine is about .002 seconds.

Of course this might not be directly applicable to your situation, but it will be much faster than a bloated PHP array of a million entries. Quite frankly, having an array that large in PHP is never the proper solution.

I'm not saying this is the proper solution either, but it definitely is workable if it fits your parameters.

Note that if your array had integers in the 0-255 range, you could get rid of the packing and just access the data as ord($data[$i]). In that case, your string would be 1M bytes long.

Finally, according to the documentation of file_get_contents(), php will memory map the file. If so, your best performance would be to dump raw bytes to a file, and use it like:

$ints = file_get_contents('ints.raw');
echo ord($ints[25]);

This assumes that ints.raw is exactly one million bytes long.

Matthew
  • 47,584
  • 11
  • 86
  • 98
  • Matthew, thanks for the longhand version of my suggestion. A couple of points: Yes `f_g_s()` uses mmap on the file in most cases (eg. not for NFS mounted file), but it still then copies the contents into a locally allocated string. Yes, the `ord($ints[NNN])`access generates the most efficient opcode sequence. Your 2mS is because the file contents are VFAT cached. This may not be the case on a production server. – TerryE Jul 29 '12 at 09:04
  • This is basically what I imagined when 6502 and TerryE suggested it, but it's great to see it written nonetheless. Those 2ms it's probably because the file was cached like @TerryE mentions, but there's a very good chance this can be cached in APC just fine with no unserializing overhead. I'll have a look at it. – Mahn Jul 30 '12 at 01:50
  • Indeed, storing and retrieving this from APC can be done in a breeze. I'm gonna accept this answer because performance and memory wise it's as good as it can possibly get if one is to store the entire array in memory; whether or not should I be storing the entire thing in memory is another question I have yet to decide about, but in the meantime this will do a better job. – Mahn Jul 30 '12 at 03:52
2

APC stores the data serialized, so it has to be unserialized as it is loaded back from APC. That's where your overhead is.

The most efficient way of loading it is to write to file as PHP and include(), but you're never going to have any level of efficiency with an array containing a million elements... it takes an enormous amount of memory, and it takes time to load. This is why databases were invented, so what is your problem with a database?

EDIT

If you want to speed up serialize/deserialize, take a look at the igbinary extension

Mark Baker
  • 209,507
  • 32
  • 346
  • 385
  • Yes, I never heard of a database :) Nothing against databases, I just figured since the data is meant to be static something simple that lives in memory like APC would do a better job; it's a pity to find out APC serializes the data though, I thought that was not the case. – Mahn Jul 28 '12 at 15:27
  • You'll find that most caches (APC, memcache, redis, etc) will need to serialize the data because they're designed as cross-platform tools, so not designed specifically for PHP datatypes/zvals. – Mark Baker Jul 28 '12 at 15:31
  • @MarkBaker but that's the thing, APC was created specifically as a PHP extension and should theoretically work with zvals directly. – Mahn Jul 28 '12 at 15:31
  • APC was written for PHP as an opcode cache, with a side benefit of storing serialized user data, but the userdata storage wasn't specific to PHP zvals but simple string storage (hence serialized) – Mark Baker Jul 28 '12 at 15:36
  • @MarkBaker thanks for mentioning igbinary, I'll check it out. – Mahn Jul 30 '12 at 03:54
1

I can't randomly generate it every time because it should be consistent and always have the same values at the same indexes.

Have you ever read up on pseudo-random numbers? There's this little thing called a seed which addresses this issue.

Also benchmark your options and claims. Have you timed the file_get_contents vs. the json_decode? There is a trade-off to be made here between storage and access costs. Eg. if your numbers are 0..9 (or 0..255) then it may be easier to store them in a 2Mb string and use an access function on this. 2Mb will load faster whether from the FS or APC.

TerryE
  • 10,724
  • 5
  • 26
  • 48
  • Yeah, algorithmically generating the same the list of integers based on a fixed seed is one of the options I was thinking on, and possibly the most elegant; I will look into pseudo-random numbers and see if it could fit for what I need. – Mahn Jul 30 '12 at 01:41
1

As Mark said, this is why databases was created - to allow you to search (and manipulate, but you might not need that) data effectively based on your regular usage patterns. It'll also might also be faster than implementing your own search using the array. I'm guessing we're talking about somewhere close to 2-300MB of data (before serialization) being serialized and unserialized each time you're accessing the array.

If you want to speed it up, try to assign each element of the array separately - you might trade function call overhead for time spent in serialization. You could also extend this with your own extension, wrapping your dataset in a small retrieval interface.

I'm guessing the reason why you can't directly store the zvals are because they contain internal state, and you simply can't just point the variable symbol table to the previous table.

MatsLindh
  • 49,529
  • 4
  • 53
  • 84