Fastest way to check for remote file (image) existence

Question

I've written a products syncing script between a local server running a merchant application and a remote web server hosting the store's eshop...

For the full sync option I need to sync about 5000+ products, with their images etc... Even with the size variations (where different product sizes - for example shoes) of the same product that share the same product image, I need to check the existence of around 3500 images...

So, for the first run, I uploaded through FTP all product images except for a couple of them, and let the script run to check if it would upload those couple of missing images...

The problem is that the script ran for 4 hours which is unacceptable... I mean, I didn't re-upload every image... It just checked every single image to determine whether it'd skip or upload it (through ftp_put()).

I was performing the check like this:

if (stripos(get_headers(DESTINATION_URL . "{$path}/{$file}")[0], '200 OK') === false) {

which is pretty fast, but obviously not fast enough for the sync to run for a logical amount of time...

How do you people handle such situations where you have to check the existence of a HUGE amount of remote files?

As a last resort, I've left to use the ftp_nlist() to download a list of the remote files and then write an algorithm to more or less do a file compare between the local and remote files...

I tried it, and it takes ages, literally 30+ mins, for the recursive algorithm to build the filelist... You see, my files are not in one single folder... The whole tree spans across 1,956 folders, and the filelist consists of 3,653 product image files and growing... Also note that I didn't even use the size "trick" (used in conjunction with ftp_nlist()) to determine whether a file is a file or a folder, but rather used the newer ftp_mlsd() which explicitly returns a type param that holds that info... You can read more here: PHP FTP recursive directory listing

hanshenrik · Accepted Answer · 2021-10-13T20:37:27.647

curl_multi is probably the fastest way. unfortunately curl_multi is rather difficult to use, an example helps a lot imo. checking urls between 2x 1gbps dedicated servers in 2 different datacenters in Canada, this script manage to check around 3000 urls per second, using 500 concurrent tcp connections (and it can be made even faster by re-using curl handles instead of open+close)

<?php
declare(strict_types=1);
$urls=array();
for($i=0;$i<100000;++$i){
    $urls[]="http://ratma.net/";
}
validate_urls($urls,500,1000,false,false,false);    
// if return_fault_reason is false, then the return is a simple array of strings of urls that validated.
// otherwise it's an array with the url as the key containing  array(bool validated,int curl_error_code,string reason) for every url
function validate_urls(array $urls, int $max_connections, int $timeout_ms = 10000, bool $consider_http_300_redirect_as_error = true, bool $return_fault_reason) : array
{
    if ($max_connections < 1) {
        throw new InvalidArgumentException("max_connections MUST be >=1");
    }
    foreach ($urls as $key => $foo) {
        if (!is_string($foo)) {
            throw new \InvalidArgumentException("all urls must be strings!");
        }
        if (empty($foo)) {
            unset($urls[$key]); //?
        }
    }
    unset($foo);
    // DISABLED for benchmarking purposes: $urls = array_unique($urls); // remove duplicates.
    $ret = array();
    $mh = curl_multi_init();
    $workers = array();
    $work = function () use (&$ret, &$workers, &$mh, &$return_fault_reason) {
        // > If an added handle fails very quickly, it may never be counted as a running_handle
        while (1) {
            curl_multi_exec($mh, $still_running);
            if ($still_running < count($workers)) {
                break;
            }
            $cms=curl_multi_select($mh, 10);
            //var_dump('sr: ' . $still_running . " c: " . count($workers)." cms: ".$cms);
        }
        while (false !== ($info = curl_multi_info_read($mh))) {
            //echo "NOT FALSE!";
            //var_dump($info);
            {
                if ($info['msg'] !== CURLMSG_DONE) {
                    continue;
                }
                if ($info['result'] !== CURLM_OK) {
                    if ($return_fault_reason) {
                        $ret[$workers[(int)$info['handle']]] = array(false, $info['result'], "curl_exec error " . $info['result'] . ": " . curl_strerror($info['result']));
                    }
                } elseif (CURLE_OK !== ($err = curl_errno($info['handle']))) {
                    if ($return_fault_reason) {
                        $ret[$workers[(int)$info['handle']]] = array(false, $err, "curl error " . $err . ": " . curl_strerror($err));
                    }
                } else {
                    $code = (string)curl_getinfo($info['handle'], CURLINFO_HTTP_CODE);
                    if ($code[0] === "3") {
                        if ($consider_http_300_redirect_as_error) {
                            if ($return_fault_reason) {
                                $ret[$workers[(int)$info['handle']]] = array(false, -1, "got a http " . $code . " redirect, which is considered an error");
                            }
                        } else {
                            if ($return_fault_reason) {
                                $ret[$workers[(int)$info['handle']]] = array(true, 0, "got a http " . $code . " redirect, which is considered a success");
                            } else {
                                $ret[] = $workers[(int)$info['handle']];
                            }
                        }
                    } elseif ($code[0] === "2") {
                        if ($return_fault_reason) {
                            $ret[$workers[(int)$info['handle']]] = array(true, 0, "got a http " . $code . " code, which is considered a success");
                        } else {
                            $ret[] = $workers[(int)$info['handle']];
                        }
                    } else {
                        // all non-2xx and non-3xx are always considered errors (500 internal server error, 400 client error, 404 not found, etcetc)
                        if ($return_fault_reason) {
                            $ret[$workers[(int)$info['handle']]] = array(false, -1, "got a http " . $code . " code, which is considered an error");
                        }
                    }
                }
                curl_multi_remove_handle($mh, $info['handle']);
                assert(isset($workers[(int)$info['handle']]));
                unset($workers[(int)$info['handle']]);
                curl_close($info['handle']);
            }
        }
        //echo "NO MORE INFO!";
    };
    foreach ($urls as $url) {
        while (count($workers) >= $max_connections) {
            //echo "TOO MANY WORKERS!\n";
            $work();
        }
        $neww = curl_init($url);
        if (!$neww) {
            trigger_error("curl_init() failed! probably means that max_connections is too high and you ran out of resources", E_USER_WARNING);
            if ($return_fault_reason) {
                $ret[$url] = array(false, -1, "curl_init() failed");
            }
            continue;
        }
        $workers[(int)$neww] = $url;
        curl_setopt_array($neww, array(
            CURLOPT_NOBODY => 1,
            CURLOPT_SSL_VERIFYHOST => 0,
            CURLOPT_SSL_VERIFYPEER => 0,
            CURLOPT_TIMEOUT_MS => $timeout_ms
        ));
        curl_multi_add_handle($mh, $neww);
        //curl_multi_exec($mh, $unused_here); LIKELY TO BE MUCH SLOWER IF DONE IN THIS LOOP: TOO MANY SYSCALLS
    }
    while (count($workers) > 0) {
        //echo "WAITING FOR WORKERS TO BECOME 0!";
        //var_dump(count($workers));
        $work();
    }
    curl_multi_close($mh);
    return $ret;
}

I'll study the code, I'll try to implement it in my case, and will report back. If the whole job to be done is an overkill, I might end up using the `ftp_nlist()` but I hope I'll manage to use your proposed way! Thanks a lot! — Faye D., Oct 13 '21 at 20:51
For the script to work the way it's supposed to, I'll have to gather all 3500+ image URLs in an array and then have your script do the check, right? Otherwise, if I call it while my main products loop runs, it'll run 3500+ times without a benefit in time, right? — Faye D., Oct 13 '21 at 20:57
OK, I'll do the necessary modifications to my code and will come back for the verdict! Fingers crossed! :D Thank you very much! — Faye D., Oct 13 '21 at 20:59
@FayeD. glhf, btw you probably shouldn't start your tests with 500x connections, some servers can't handle it. in particular, if your target server is a shared webhost, you'll risk being auto-ip-banned. if your target server is nginx or lighthttpd, you'd probably be fine, but apache or IIS may choke. try starting with 50 or 100, not 500 (unless you know it's safe :P ) — hanshenrik, Oct 13 '21 at 21:04
For the production server, when the e-shop gets completed I can't tell yet, as I'm on a staging environment while I'm setting it up. My staging environment is indeed a shared HG account, so I'll start low on the setting... Anything around 2-3 mins is OK... 4-5 hrs isn't... Hehehe! Thanks for the heads up! — Faye D., Oct 13 '21 at 21:10
Your script is fantastic! It checked 3653 images in under 2 minutes! Basically it's so fast that HG's TOS for CPU usage wasn't able to detect the resources burst at all, as the script ran for under 90 seconds! LOL... Ah, there is a small inconsistency in the arguments expected by `validate_urls()` (4) and the ones you pass in your example to it (5)... Anyway, I'll probably simplify it a bit, but even like that it’s working great! Thank you so much! — Faye D., Oct 14 '21 at 00:49
Additionally, although this function you've built is indeed a marvelous piece of code, which you've obviously built after some serious research and thinking, it shopuld be noted that it needs some tweaking in terms of setting up the ideal arguments - according to the web server capabilities - in order for it to return consistent and reliable results... I.e, I ran a dozen of times the command `echo '
' . print_r(count(array_diff($urls, validate_urls(array_values(array_unique($urls)), 100, 1000, false, false))), true) . '
';` to get _ — Faye D., Oct 14 '21 at 01:40
the supposedly non-existent images, and I got anything between 580 and 605 addresses out of 3653, where only 1 was truly non-existent. When I increased the timeout value to 1500 again I kept getting around 570 failing addresses. It was only after I set the timeout to 2000 that I got the correct 1 non-existent image file no matter how many times I ran the command... Nevertheless, I want to thank you once again SO much for sharing this great piece of code! :) — Faye D., Oct 14 '21 at 01:41

Fastest way to check for remote file (image) existence

1 Answers1