3

I have an XML file localy. It contains data from marketplace. It roughly looks like this:

<offer id="2113">
    <picture>https://anotherserver.com/image1.jpg</picture>
    <picture>https://anotherserver.com/image2.jpg</picture>
</offer>
<offer id="2117">
    <picture>https://anotherserver.com/image3.jpg</picture>
    <picture>https://anotherserver.com/image4.jpg</picture>
</offer>
...

What I want is to save those images in <picture> node localy.

There are about 9,000 offers and about 14,000 images.

When I iterate through them I see that images are being copied from that another server but at some point it gives 504 Gateway Timeout.

Thing is that sometimes error is given after 2,000 images sometimes way more or less.

I tried getting only one image 12,000 times from that server (i.e. only https://anotherserver.com/image3.jpg) but it still gave the same error.

As I've read, than another server is blocking my requests after some quantity.

I tried using PHP sleep(20) after every 100th image but it still gave me the same error (sleep(180) - same). When I tried local image but with full path it didn't gave any errors. Tried second server (non local) the same thing occured.

I use PHP copy() function to move image from that server. I've just used file_get_contents() for testing purposes but got the same error.

I have

set_time_limit(300000);
ini_set('default_socket_timeout', 300000);

as well but no luck.

Is there any way to do this without chunking requests?

Does this error occur on some one image? Would be great to catch this error or just keep track of the response delay to send another request after some time if this can be done?

Is there any constant time in seconds that I have to wait in order to get those requests rollin'?

And pls give me non-curl answers if possible.

UPDATE

Curl and exec(wget) didn't work as well. They both gone to same error.

Can remote server be tweaked so it doesn't block me? (If it does).

p.s. if I do: echo "<img src = 'https://anotherserver.com/image1.jpg'" /> in loop for all 12,000 images, they show up just fine.

temo
  • 612
  • 1
  • 9
  • 25
  • `if I do: echo "` PS your quotes are wrong here. You have a double quote in the src attribute and not one at the end of the string, which is a syntax error. I suppose it's probably a typo in the question. – ArtisticPhoenix Feb 04 '19 at 19:20
  • 1
    Typically I get around things like this by using proxies, that way you can spread the requests over several IP addresses. But this `And pls give me non-curl answers if possible.` – ArtisticPhoenix Feb 04 '19 at 19:21

2 Answers2

2

Since you're accessing content on a server you have no control over, only the server administrators know the blocking rules in place.

But you have a few options, as follows:

  • Run batches of 1000 or so, then sleep for a few hours.
  • Split the request up between computers that are requesting the information.
  • Maybe even something as simple as changing the requesting user agent info every 1000 or so images would be good enough to bypass the blocking mechanism.
  • Or some combination of all of the above.
Difster
  • 3,264
  • 2
  • 22
  • 32
  • User Agent didn't help. Batches of 1000 woudn't eather because waiting few hours is not an option. When I try to save files error is occuring, but when I just `echo ""` it apperas on browser page. I guess I could try to go over all of them via javascript, get base64 of the images and save this data as an image file? Or am I missing something very important? – temo Nov 01 '18 at 08:21
  • 1
    I'm curious as to why curl isn't an option for you? Will it display ALL of the images in the browser for you? I recently did a project where I used wget to grab tens of thousands of html pages from one site. It wasn't php though, I ran it as a shell script. – Difster Nov 01 '18 at 08:28
  • So I could do exec() and run wget and pass image url-s in variables and it would work right? I haven't ever used wget can I use it like that? I need this to work as fast as possible and as I've read CURL is way slower. – temo Nov 01 '18 at 08:44
  • Give me a few and I'll sanitize the script I used and paste it in to an answer. (it's too long for a comment). – Difster Nov 01 '18 at 08:53
  • 1
    I ran a different script to log in and save a cookie, but here's an example. It grabbed entire html pages though... #!/bin/bash m=1 while [ $m -lt 20350 ]; do wget --load-cookies cookies.txt \ --header="Accept: text/html" \ --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36" \ http://somewebsite.com/editcontactform.cfm?id=$m let m=m+1 sleep .25 done – Difster Nov 01 '18 at 08:57
  • I got `wget -P --cut-dirs=3 -np -nH /home/subd.mywebsite.com/images/ -A jpeg,jpg,bmp,gif,png https://anotherdomain.com/images/detailed/27/image3.png` and this creates folders. I want to save image file only in my directory. I don't know what parameters I should use for that. I have added parameters that I found on stackoverflow. None is working. And also, when saving file, can I rename image and name it however I want? – temo Nov 01 '18 at 09:10
  • Yes, you can rename images however you want. – Difster Nov 01 '18 at 09:13
  • You almost saved my day. I mean I have still one issue. `exec(wget)` or `exec(/usr/bin/wget)` are working. When I run same command in terminal, image is saved as should. But with exec I can not run any command I guess. Can you answer that as well? Or should I aks a new question? – temo Nov 01 '18 at 09:52
  • I ran it as a shell command, not in PHP. Maybe that would help. – Difster Nov 01 '18 at 09:56
  • I've had a good feeling about this. But it didn't work. with wget same error occures. Any other suggestions? – temo Nov 01 '18 at 12:23
  • Nothing I can think of other than using multiple computers (IP addresses) and doing it in batches. How many are you getting at a time? You might only need to wait 30 minutes between batches, it's hard to say. – Difster Nov 01 '18 at 23:00
  • That's the thing. One time I just executed script and all went well. All of the images were copied. I thking I'm gonna use multiple ip addresses. I'm gonna post an answer when I have one. Just why images are displayed when I put echo ""? Is it some other kind of a request? It is a still request and its 12 000 of them anyways... – temo Nov 02 '18 at 06:23
  • @temo did you rule out if echo " worked in private mode? maybe it only worked because you had a session cookie stored in your browser. If thats the case, you should be able to use this cookie with curl / wget – Bolli Feb 05 '19 at 00:25
0

I would suggest you to try following 1. reuse previously opened connection using CURL

$imageURLs = array('https://anotherserver.com/image1.jpg', 'https://anotherserver.com/image2.jpg', ...);
$notDownloaded = array();
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);

foreach ($imageURLs as $URL) {
    $filepath = parse_url($URL, PHP_URL_PATH);
    $fp = fopen(basename($filepath), "w");
    curl_setopt($ch, CURLOPT_FILE, $fp);
    curl_setopt($ch, CURLOPT_URL, $URL);
    curl_exec($ch);
    fclose($fp);
    if (curl_getinfo($ch, CURLINFO_RESPONSE_CODE) == 504) {
        $notDownloaded[] = $URL;
    }
}
curl_close($ch);
// check to see if $notDownloaded is empty
  1. If images are accessible via both https and http try to use http instead. (this will at least speed up the downloading)
  2. Check response headers when 504 is returned as well as when you load url your browser. Make sure there are no X-RateLimit-* headers. BTW what is the response headers actually?
Constantine
  • 650
  • 9
  • 15
  • I'll give this a shot. – temo Feb 05 '19 at 11:12
  • I tried this when you posted but no luck. I moved to other segments of the project and soon have to come back to this one. Hope I will have some answer by that time. – temo Nov 05 '19 at 12:28