1

I have a script which is essentially a crawler to index news articles. The script works fine on one server (main http server), but I am trying to move it to a dedicated platform and one section will not function.

The part that fails uses a simple function (from SO) to check if a string (a url found by the crawler) matches an exclusion list stored locally in a .txt file.

I have tested to make sure the .txt file is received using a var_dump and everything shows ok.

This fails consistently to unset or echo out positives, but on the other server everything works ok.

The important part is as follows:

<?php
ini_set('display_errors', 1);
$linkurl_reg = '/href="http:\/\/metro.co.uk(.+?)"/is';    


function endsWith($haystack, $needle)
{
return $needle === "" || substr($haystack, -strlen($needle)) === $needle;
}

$data = file_get_contents("http://metro.co.uk");
preg_match_all($linkurl_reg,$data,$new_links);

$exclusion_list = explode("\n",file_get_contents('../F/exclusion_list.txt'));

var_dump($exclusion_list); //just to check we got the file ok

for($i = '0';$i < count($new_links[1]) ; $i++){
        for ($ii = '0';$ii < count($exclusion_list);$ii++){
        if(endsWith($new_links[1][$i], $exclusion_list[$ii])){echo 'unset ';unset($new_links[1][$i]);}else{echo'not unset ';}
        }
    }


?> 

The strange thing is if I only use a single value when setting the exclusion list e.g

$exclusion_list[0] = "xmlrpc.php"; 

instead of

$exclusion_list = explode("\n",file_get_contents('../F/exclusion_list.txt'));

it will work for that particular string.

Please if anybody has anyideas, I have been staring at this for 3 days now and am completely stumped.

Things I have tried:

encoding the $exclusion_list array to UTF before exploding.

encoding the $exclusion_list strings to UTF in the loop

tested the function with normal strings

writing the strings in manually rather than from the array or fileget (works annoyingly)

changing the fileextension from .txt to various other things

updating php version on the server (non working one)

replacing "\n" with "\r" and "\n\r" during explode

I have even tried changing the function to some of the others found on SO, strangely I get the same results (works with strings I define but not with anything retrieved from the exclusion_list file).

For the life of me I have no idea why one would work and not the other.

Current PHP version: 5.4.36-0+deb7u3 (non working server)

Current PHP version: 5.2.17 (working server)

requested var_dump for $exclusion list (non working server):

array(9) {
  [0]=>
  string(6) ".jpeg"
  [1]=>
  string(5) ".jpg"
  [2]=>
   string(5) ".gif"
  [3]=>
  string(5) ".css"
  [4]=>
  string(5) ".xml"
  [5]=>
  string(11) "xmlrpc.php"
  [6]=>
  string(21) "metro.co.uk" target="
  [7]=>
  string(20) "metro.co.uk/osd.xml"
  [8]=>
  string(32) "metro.co.uk/terms/#privacypolicy"
}

requested var_dump for $exclusion list (working server):

array(9) {
  [0]=>
  string(5) ".jpeg"
  [1]=>
  string(4) ".jpg"
  [2]=>
  string(4) ".gif"
  [3]=>
  string(4) ".css"
  [4]=>
  string(4) ".xml"
  [5]=>
  string(10) "xmlrpc.php"
  [6]=>
  string(20) "metro.co.uk" target="
  [7]=>
  string(19) "metro.co.uk/osd.xml"
  [8]=>
  string(32) "metro.co.uk/terms/#privacypolicy"
}

Both servers are linux, both text files are not built or edited on windows platforms

3 Answers3

1

Make sure, the lines in your *.txt file are separated by \n not \r\n, which happens if you save in a windows program.

Otherwise after you explode it with '\n' the strings will all end with '\r' and thus may not fullfill the endsWith() condition

This code should work on both machines:

$exclusion_list = explode("\n",str_replace("\r", "", file_get_contents('../F/exclusion_list.txt')));
h3n
  • 880
  • 1
  • 10
  • 26
  • Both servers are linux unfortunately, ill add this to the list I tried, its been so many I completely forgot I tried this. Thankyou anyway though!! – user3846011 Feb 18 '15 at 17:00
  • 1
    it doesnt matter if the servers are linux. it matters which machine the txt file was saved on. Did you try to replace you explode command with the version I gave you above? – h3n Feb 18 '15 at 17:03
  • and also if this doesnt work: please put echo $needle . " | " . substr($haystack, -strlen($needle)); into your endsWith() function and give me the result – h3n Feb 18 '15 at 17:10
  • Honestly you have no idea how grateful I am, I am note sure why this didnt work when I changed the "\n" in explode but I would never have through to check again. Literally you have restored my faith in society:-) – user3846011 Feb 18 '15 at 17:17
0

If one of your server or computer is using Windows, you have probably a problem with the end of line encoding : \r\n on Windows and \n on unix (and I think \r on iOS, but I'm not sure)

0

May be some issue in file , Try using some other file and check if it shows same issue or not.

parveen
  • 557
  • 3
  • 13