1

Not sure why there isn't yet a "hack" tag (sorry to list in PHP), but...

I am wondering if it/how it would be possible to walk an array using multiple threads using the multithreaded/async feature of hack. I don't really need this, but it is a curiosity and might be useful.

I've looked at the documentation for "Hack"'s async feature

http://docs.hhvm.com/manual/en/hack.async.php

and its a bit difficult.

Here is the basic idea of what I would like to make (or see done):

a) Split up the array into x sections and process it on x "threads" or b) create x threads and each processes the latest available item, ie. when it the thread processes the item, it asks the parent thread for a new one to process. Hack doesn't do "threads", but the same is represented by an asyc function

Basically, the end goal is to quickly optimize a standard foreach block to run on multiple threads, so minimal code change is required, and also to see what hack can do and how it works.

I've come up with some code as a sample, but I think I've totally got the idea wrong.

class ArrayWalkAsync
{
    protected $array;
    protected $threads = Array();
    protected $current_index = 0;
    protected $max_index;
    protected $threads = 4;

    public function array_walk($array)
    {
        $this->array = $array;
        $this->max_index = count($array) - 1;
        $result = Array();
        for ($i=0;$i<$this->threads;$i++)
        {
            $this->threads[] = new ArrayWalkThread();
        }
        $continue = true;
        while($continue)
        {
            $awaitables = Array();
            for ($i=0;$i<$this->threads;$i++)
            {
                $a = $this->proccesNextItem($i);
                if ($a)
                {
                    $this->threads[] = $a;
                } else {
                    $continue = false;
                }
            }
            // wait for each
            foreach ($awaitables as $awaitable_i)
            {
                await awaitable_i;
                // do something with the result
            }
        }
    }

    protected function proccesNextItem($thread_id)
    {
        if ($this->current_index > $this->max_index)
        {
            return false;
        }
        $a = new ArrayWalkItem();
        $a->value = $this->array[$this->current_index];
        $a->index = $this->current_index;
        $this->current_index++;
        return $this->threads[$thread_id]->process($a,$this);
    }

    public function processArrayItem($item)
    {
        $value = $item->value;
        sleep(1);
        $item->result = 1;
    }

}


class ArrayWalkThread
{
     async function process($value,$parent): Awaitable<?ArrayWalkItem>
     {
        $parent->processArrayItem($a);
     }

}

class ArrayWalkItem
{
    public $value;
    public $result;
}
Josh Watzman
  • 7,060
  • 1
  • 18
  • 26
user1122069
  • 1,767
  • 1
  • 24
  • 52

1 Answers1

3

Hack's async functions aren't going to do what you want. In Hack, async functions are not threads. It's a mechanism to hide IO latency and data fetching, not to do more than one computation at once. (This is the same as in C#, from where the Hack feature derives.)

This blog post on async functions has a good explanation:

For several months now, Hack has had a feature available called async which enables writing code that cooperatively multitasks. This is somewhat similar to threading, in that multiple code paths are executed in parallel, however it avoids the lock contention issues common to multithreaded code by only actually executing one section at any given moment.

“What’s the use of that?”, I hear you ask. You’re still bound to one CPU, so it should take the same amount of time to execute your code, right? Well, that’s technically true, but script code execution isn’t the only thing causing latency in your application. The biggest piece of it probably comes from waiting for backend databases to respond to queries.

[...]

While [an http] call is busy sitting on its hands waiting for a response, there’s no reason you shouldn’t be able to do other things, maybe even fire off more requests. The same goes for database queries, which can take just as long, or even filesystem access which is faster than network, but can still introduce lag times of several milliseconds, and those all add up!

Sorry for the confusion on this point -- you're not the only one to try to erroneously use async this way. The current docs do a terrible job of explaining this. We're doing a revamp of the docs; the current draft does a somewhat better job, but I'm going to go file a task to make sure it's crystal clear before we launch the new docs.

Community
  • 1
  • 1
Josh Watzman
  • 7,060
  • 1
  • 18
  • 26
  • Can you write a HHVM function called array_walk_multithreaded($array,$callable,$threads) and array_walk_multithreaded_greatest_value (returns index of the greatest result of $callable)? If $callable can only access $key and $value, then it could be relatively easy to make. Maybe FB doesn't need, but it is likely that they process arrays also. – user1122069 Oct 06 '15 at 14:26
  • I mean, to write it in c++ and let it run like a plugin. – user1122069 Oct 06 '15 at 15:28
  • The expense of spawning the threads, marshaling data from the PHP VM into C++ and then between the threads (and back into the VM), dealing with the locking issues, etc are going to *far* outweigh the benefit of doing this in parallel. In many applications, CPU is rarely your bottleneck -- you're going to spend most of your time blocked on IO, which is exactly what async functions are supposed to help with. – Josh Watzman Oct 06 '15 at 18:43
  • For the record, my script uses 100% CPU for a good duration, and another app of mine (live and with customers) used so much CPU that I added sleep statements to throttle it. I agree that it is not in general useful - I was told regarding API rate limits that "for long running tasks just setup a progress bar - what is the difference between 10 minutes and 40 minutes". If I'm curious enough, then I'll learn C++ and port my script there. – user1122069 Oct 07 '15 at 02:37
  • Yeah, the request-based model of PHP is not really well suited for long-running requests/scripts (on HHVM or PHP5/PHP7). It's certainly doable if you're careful, but you might be better off in another language. If you really want to use PHP, one option might be to split up the work into large, coarse work units, and then use curl requests to localhost to trigger them off in separate shorter-lived requests, controlled by a master who is doing little but farming them out. – Josh Watzman Oct 07 '15 at 06:45
  • I've built my threaded script with C++. Its my first C++ experience, runs 9x faster on 12 cores, at least according to initial tests. Roughly consistent with the number of cores, 4x faster on 4 cores. – user1122069 Oct 08 '15 at 00:47
  • Unfortunately, it is slower, even on not-threaded function on the second set of iterations (50 instead of 10,000). Can't figure out how that could be. Must be some slight change in the algorithm. – user1122069 Oct 08 '15 at 17:25