1

I've been parsing tons of RSS feeds using PHP's simplexml_load_file and it works like a charm. Now I'm trying to do the same for the RSS feed of the Financial Times. When I do...

$rss = simplexml_load_file("http://www.ft.com/rss/world");

... I get:

Warning: simplexml_load_file(): http://www.ft.com/rss/world:11: parser error : Opening and ending tag mismatch: link line 8 and head in rss.php on line 6

Warning: simplexml_load_file(): oat:left;margin-right:20px;margin-top:3px;width:35px;height:31px;}</style></head in rss.php on line 6

Warning: simplexml_load_file(): ^ in rss.php on line 6

Warning: simplexml_load_file(): http://www.ft.com/rss/world:37: parser error : Opening and ending tag mismatch: input line 37 and li in rss.php on line 6

Warning: simplexml_load_file(): ^ in rss.php on line 6

and many, many more warnings (around 100).

I've searched Stackoverflow for answers, but I can't find anything that seems to apply to this case. What am I missing here?

TheBigDoubleA
  • 432
  • 2
  • 7
  • 26

2 Answers2

1

For some websites to work, you need to have a user-agent set with the HTTP request. As the default in PHP might be empty (which seems a sane setting privacy wise), you need to set it for the request:

ini_set('user_agent', "Godzilla/42.4 (Gabba Gandalf Client 7.3; C128; Z80) Lord of the RSS Weed Edition (KHTML, like Gold Dust Day Gecko) Chrome/97.0.43043.0 Safari/1337.42");

$rss = simplexml_load_file("http://www.ft.com/rss/world");
hakre
  • 193,403
  • 52
  • 435
  • 836
0

Your code works for me here. Try omitting LIBXML_NOWARNING & LIBXML_NOERROR (which suppress any errors you might be getting) to see where it went wrong.

Othi
  • 336
  • 1
  • 6
  • Have you tried with the FT feed? I omitted the LIBXML extensions, yet it's still the same. vardump returns false. Please bear in mind that this code works fine for most other feeds... – TheBigDoubleA May 22 '14 at 12:32
  • 1
    It appears you're getting HTML from the URL. Try fetching it with file_get_contents and echo'ing it to see what your webserver is receiving. Maybe they're filtering some user agents from fetching their feed. – Othi May 22 '14 at 13:06
  • You're right: I get an html page, which is this one: http://www.ft.com/gfdlgjfdglkfjdgd. How can I overcome this? – TheBigDoubleA May 22 '14 at 14:16