HHVM staticly typing lookup tables and keeping them fully cached in RAM

Question

I'm doing scientific research, processing through millions of combinations of multi-megabyte arrays.

For you to be capable of answering this question you will need to have knowledge/experience of all of the following

how HHVM is able to cache data structures in RAM between requests
how to tell HHVM data structures will be constant
how to declare array index and value types

I need to process the entire arrays, so it's a lot of data to be loaded and processed. (millions of requests within minutes on a LAN). The faster I can complete requests the quicker I can complete my work. If HHVM has to do work loading this data on each request, it accounts for a significant fraction of the time to complete the request (sometimes more than half, it depends on the complexity of the analysis I'm doing at the time).

I have found a method that has allowed me to keep these data structures cached in RAM (no loading from files, interpreting code, pushing to the array hundreds of thousands of times for no reason, no pointless repetitive unserialize etc), and thus I have eliminated this massive measurable delay.

I have 3 questions regarding how I can make this even faster:

Is the way I'm doing it now creating a global scope penalty?
How can I declare my arrays as constant and tell HHVM what data types to expect? If I declare my arrays as constant is it even necessary to declare the types for HHVM?
Instead of using nested arrays, if I use 3 separate data structures ImmVector, PackedArray, or define a class would it be faster?

Keep in mind that anything that prevents HHVM from caching the data structure in RAM between requests should be regarded as unacceptable.

Lookuptable35543.php

<?php
$data = [
    ["uuid (20 chars)", 5336, 7373],
    ["uuid (20 chars)", 5336, 7373],
    #more lines as above
];
?>

Some of these files are many MB in size and there are a lot of them Main.php

<?php
function main() {
  require /path/to/Lookuptable35543.php;
  #(Do stuff with $data)
}
?>

This is working quite well, as Main.php gets thousands of requests, in a short period of time, HHVM keeps Lookuptable.php's data structure in memory. Avoiding pointless processing and IO, as it just sits in RAM, ready for use. (I have more than enough RAM)

Unfortunately, the only way I know how to make HHVM hold the lookup table in RAM is, I set $data in the global scope inside my lookup####.php file (then require the lookup file into a function in the data processing file: Main.php)? This way HHVM doesnt bother loading the file or re executing the code to create $data, because it can see that $data can be determined at compile time, and it will not ever change during runtime. This works but I dont know if there is a penalty from having the $data exist in the lookup###.php file's global scope. (Or maybe its not global at all because it is required into main.php's function?)

What if I return $data from a function inside Lookup.php and call that function from Main.php like this Main.php

Would the HHVM JIT the result of getData() in RAM? Somehow I associate functions with unpredictability... but maybe HHVM is clever enough to know that the functions result can be determined at compile time, and never changes?

I can't put the lookup table inside Main.php because I require different lookup tables based on the type of request.

Is there a way I can tell HHVM that my outer array will always have an integer index that never changes, and the values of the outer array will always be an array? Perhaps I need to use ImmVector? Then is there a way to tell HHVM that my inner array will always be a fixed length string followed by 2 integers, always, no extra elements, contents never changes?

I'd prefer not to use OO or create a class. How can I declare types, procedural style? If a class is absolutely necessary can you please give example code suitable for my requirements above?

Will it be faster if I dont nest arrays? I just realized I could have one array with integer index and values of fixed length string. Then a 2nd array with integer index and integer values, and a 3rd one with integer index and integer values.

If you're not familiar with this HHVM caching technique please do not waste mutual time suggesting a database, redis, APC, unserialize, etc. The fastest is for HHVM to just keep my various $data variables in RAM. Even unserializing $data from a ramdisk file is slow, because then the entire data structure must be parsed as a string and converted into a data structure in memory for every request. APC has the same problem as far as i know. I dont want to even have to copy $data. The lookup tables are immutable, read only. They must just stay fully structured in RAM. My current caching solution (at the top of this question) has already given me huge gains, but as per my 3 questions I think there may be more gains to be had?

Incase you're wondering, I have measured the latency of various data loading or caching methods. Now I basically want to keep the caching situation I have, but give the HHVM JIT maximum confidence about how to type my data, so it can save time not running type or even bound (array size) checks.

Edit Ok so nobody has been able to give me any code examples yet, so I'm just trying stuff out.

Here's what I've found out so far.

const arrays don't work yet in HHVM. const foo = ['uuid1',43,43]; throws an error about HHVM only supporting constants with scalar values.
Vector with Array values: I don't know how it will perform yet... I expect it will be better than a normal array. This is valid HH code.

This is progress, because HHVM should be able to cache this in the same way, HHVM knows this whole structure is constant, and HHVM knows the indexes are all integers.

What I'm still not entirely happy about with this structure is this: Consider this code

for ($n=0;$n<count($iv);++$n) if ($x > $iv[$n][1]) dosomething();

Will HHVM perform a type check on $if[$n][2] on every loop iteration? In my definition of $iv above, there is nothing that says the 2nd element of the inner array will be an integer. How can I improve on this? Can disabling the type checker be of any use? Does this only hide errors from the external type checker, or does it prevent HHVM from constantly doing type checks? (I'm thinking it's the first thing)

Perhaps if I could make my own user-defined type that would solve the problem?

<?hh
#I don't know what mechanisms for UDT's exist, so this code is made-up
CreateUDT foo = <string,int,int>;
$iv = ImmVector<foo> {
    ['uuid1',425,244],
    ['uuid2',658,836]
};
print_r($iv);

I found a reference to this at Hack Collections Literal Syntax Vector<Foo> unfortunately it might not be available to use yet.

Josh Watzman · Answer 1 · 2015-05-11T04:08:47.347

I'm a software engineer at Facebook working on HHVM.

This entire question reeks of premature optimization to me. Have you done profiling and determined that loading this array is actually a bottleneck for your app? (Not just microbenchmarks, but how it actually affects the performance, latency, RPS, etc of realistic pageloads.) And also isolated from other effects, e.g., if this array is a cache or some sort of precomputed data, you need to isolate the win of precomputing the data from the actual time to load it by caching it in various different ways.

In general, HHVM is very good at dealing with arrays, since they are so hot in nearly every codepath -- and in particular at constant arrays like this one. To your questions about how to inform it of the shape and types of things in the arrays, HHVM can figure that all out for itself, and is very good at doing so on constant arrays composed entirely of constants. (And the ways it thinks about arrays aren't quite the ways you think about arrays, so it can probably do a better job anyway!) Basically, unless profiling says this is actually a hotspot -- which I'm pretty skeptical of -- I wouldn't worry too much about it. A couple general notes to be aware of:

Measure every performance diff. Don't prematurely optimize -- use profiling to guide. The developer productivity lost by premature optimizations getting in the way can be lethal.
Get things out of toplevel ("pseudomains") as much as possible. A function which returns a static or constant array should be just fine, and will in general help HHVM optimize code even better.
Avoid references as much as possible, especially in this array if you care about performance so much.
You probably should look into repo authoritiative mode which can help HHVM optimize lots of things even more -- but in particular for this case, the more aggressive inlining that repo auth mode can do might be a win.

Edit, aside:

because then the entire data structure must be parsed as a string and converted into a data structure in memory for every request. APC has the same problem as far as i know

This is exactly what I mean by premature optimization: you're rejecting APC without even trying it, even if it might be a cleaner way of doing what you want. It turns out that, in most cases, HHVM actually can optimize away the serialization/deserialization of storing arrays in APC, particularly if they are constant arrays that are never modified. As above, HHVM is very good at optimizing lots of common patterns. Just write code that's clean, profile it, and fix the hotspots.

It is so ridiculous that you assume that I'm doing premature optimization. I have even explicitly said that I've measured the performance differences. You're wasting both of our time. I just need to know how to define immutable data structures that HHVM can keep cached in RAM. When is the last time you loaded about 100MB of arrays into memory to service each request with millions of variations running as fast as HHVM can go? I've worked on this codebase for over 6 months and have profiled and improved it constantly. I stopped working on it before the hack spec was out. Resuming work now. — user1862165, May 11 '15 at 08:34
Instead of answering my question you assume im doing it wrong, even though there is no reason to believe I am prematurely optimising. I said I measure and yoi ignored that. I've tried unserialize (lol I used it primarily before my current method) and I've tried APC (maybe every not every possible way with APC). If you read my question it should be obvious thst there must be a better way than parsing 100mb of serialozed arrays on every request. I'm doing scientific research crunching a lot of data. PHP is not ideal because of the request model, but with HHVM its sortof possible to overcome that — user1862165, May 11 '15 at 08:41
In the past I was not aware of a way to tell HHVM that a certain APC cache is constant. If I am able to, and can optimize away the unserialization, then I have the burden of managing when to cache and expire data. At the moment HHVM is managing that nicely for me. That still does not address my question one bit about telling HHVM what type things are. — user1862165, May 11 '15 at 08:45
I did answer your question. As I said above, HHVM is already very good at optimizing static arrays, and is going to very likely do a better job of figuring out the right way to type and cache the array than you are. It does not, in fact, reparse and regenerate the array on every request -- it's actually smart enough to only have one array in the entire process, shared among every request. Repo authoritative mode is my only recommendation for improving this. — Josh Watzman, May 11 '15 at 16:54

score 0 · Accepted Answer · answered May 11 '15 at 11:18

Okay I've solved my first question.

I don't have any global scope issues. My require is being done from inside function main(), so it's as if the code from lookuptable####.php is being inserted into function main(). HHVM docs: "If the include occurs inside a function..." Basically if you were to open lookuptable####.php it looks like the code is in global scope, but that's not the file that is being requested from hhvm. main.php is the one being requested, thus there is no code in global scope.
I think I've answered my 2nd question, it's currently at the bottom of my question. I'm not 100% convinced, but I'm pretty happy to move ahead and test it.

After wasting my time with all of this the HHVM guys came back and said that ImmVectors are currently slower than arrays. Although they should in theory be faster, they're not because they haven't been optimized yet. They're probably implemented as PHP classes. — user1862165, Jun 05 '15 at 12:50

HHVM staticly typing lookup tables and keeping them fully cached in RAM

2 Answers2