0

I am trying to download the contents of the robots.txt file

My original problem link: PHP file_exists() for URL/robots.txt returns false

this is the line 22: $f = fopen($file, 'r');

I get this error :

PHP Error[2]: fopen(http://www1.macys.com/robots.txt): failed to open stream: Redirection limit reached, aborting
    in file /host/chapache/host/apache/www/home/flaviuspogacian/proiecte/Mickey_ClosetAffair_Discovery/webroot/protected/modules/crawler/components/Robots.php at line 22
#0 /host/chapache/host/apache/www/home/flaviuspogacian/proiecte/Mickey_ClosetAffair_Discovery/webroot/protected/modules/crawler/components/Robots.php(22): fopen()

for this code, where $website_id is a number and $website is like http://www.domain.com/

public function read_website_save_2_db($website_id, $website) {
    $slashes = 0;
    for ($i = 0; $i < strlen($website); $i++)
        if ($website[$i] == '/')
            $slashes++;
    if ($slashes == 2)
        $file = $website . '/robots.txt';
    else
        $file = $website . 'robots.txt';
    echo $website_id . ' ' . $file . PHP_EOL;
    try {
        $f = fopen($file, 'r');
        if (($f) || (strpos(get_headers($file, 1), "404") !== FALSE)) {
            fclose($f);
            echo 'exists' . PHP_EOL;
            $curl_tool = new CurlTool();
            $content = $curl_tool->downloadFile($file, ROBOTS_TXT_FILES . 'robots_' . $website_id . '.txt');
            //if the file exists on local disk, delete it
            if (file_exists(ROBOTS_TXT_FILES . 'robots_' . $website_id . '.txt'))
                unlink(ROBOTS_TXT_FILES . 'robots_' . $website_id . '.txt');
            echo ROBOTS_TXT_FILES . 'robots_' . $website_id . '.txt', $content . PHP_EOL;
            file_put_contents(ROBOTS_TXT_FILES . 'robots_' . $website_id . '.txt', $content);
        }
    else {
        echo 'maybe it\'s not there' . PHP_EOL;
    }
} catch (Exception $e) {
    echo 'EXCEPTION ' . $e . PHP_EOL;
}

}

Community
  • 1
  • 1
Ionut Flavius Pogacian
  • 4,750
  • 14
  • 58
  • 100
  • 1
    You can use [`file_get_contents`](http://www.php.net/manual/en/function.file-get-contents.php) instead `fopen();fread();fclose();`. Oh, and this won't solve your problem, it's just advice. :) – Leri Aug 15 '12 at 11:35
  • `Redirection limit reached, aborting` sounds like quite a clear error message. The URL is redirecting in a loop apparently. – deceze Aug 15 '12 at 11:36
  • Since it loads OK in a browser with no redirects, try adding a sensible `User-Agent:` header – DaveRandom Aug 15 '12 at 11:41
  • found the error, its not strpos(get_headers($file, 1), "404") !== FALSE), its strpos(get_headers($file, 0), "404") !== FALSE) – Ionut Flavius Pogacian Aug 15 '12 at 11:45

1 Answers1

2

Parts of your code seems messy. I would do something like this instead (but of course not echo from within the function, its just for example)

public function read_website_save_2_db($website_id, $website) {
  $url = rtrim($website, '/') . '/robots.txt';
  $content = @file_get_contents($url);
  $status = 0;
  $success = false;
  if( !empty($http_response_header) ) {
    foreach($http_response_header as $header) {
      if(substr($header, 0, 6) == 'HTTP/1') {
        $status = trim(substr($header, strpos($header, ' '), strlen($header)));
        $success = strnatcasecmp($status, '200 OK') === 0;
        break;
      }
    }
  }
  if(!$success) {
    echo 'Request failed with status '.$status;
  }
  elseif(!$content) {
    echo 'Website responded with empty robots.txt';
  }
  else {
    file_put_contents(ROBOTS_TXT_FILES . 'robots_' . $website_id . '.txt', $content);
    echo 'Wii, we have downloaded a copy of '.$url;
  }
}
xCander
  • 1,338
  • 8
  • 16