0

I'm finishing a function to load RSS/XML in PHP using DOMDocument.

In most cases it works perfectly, however with some RSS sources it does not.

I also have a general file_get_contents-style function utilizing cURL for general web requests that also works fine. Using this I was able to compare the Headers of the various sources that do and do not work. Unless there is some functionality or configuration of DOMDocument that can easily correct this I will consider using file_get_contents or cURL to load the raw content then process it without the use of DOMDocument. Extra code aside is there any reason DOMDocument::load would be preferred over these other methods?

What works

http://feeds.marketwatch.com/marketwatch/bulletins

Headers:

HTTP/1.1 200 OK
Content-Type: text/xml; charset=utf-8
Vary: Sec-Fetch-Dest, Sec-Fetch-Mode, Sec-Fetch-Site
feedburnerv2: 
Last-Modified: Fri, 28 Oct 2022 16:34:04 GMT
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Pragma: no-cache
Expires: Mon, 01 Jan 1990 00:00:00 GMT
Date: Fri, 28 Oct 2022 17:45:20 GMT
Strict-Transport-Security: max-age=31536000
Accept-CH: Sec-CH-UA-Arch, Sec-CH-UA-Bitness, Sec-CH-UA-Full-Version, Sec-CH-UA-Full-Version-List, Sec-CH-UA-Model, Sec-CH-UA-WoW64, Sec-CH-UA-Platform, Sec-CH-UA-Platform-Version
Report-To: {"group":"RaichuFeedServer","max_age":2592000,"endpoints":[{"url":"https://csp.withgoogle.com/csp/report-to/RaichuFeedServer/external"}]}
Cross-Origin-Opener-Policy: same-origin; report-to="RaichuFeedServer"
Content-Security-Policy: script-src 'report-sample' 'nonce-63RlJD8aykRdOjuRncZTYA' 'unsafe-inline';object-src 'none';base-uri 'self';report-uri /_/RaichuFeedServer/cspreport;worker-src 'self'
Cross-Origin-Resource-Policy: same-site
Permissions-Policy: ch-ua-arch=*, ch-ua-bitness=*, ch-ua-full-version=*, ch-ua-full-version-list=*, ch-ua-model=*, ch-ua-wow64=*, ch-ua-platform=*, ch-ua-platform-version=*
Content-Encoding: gzip
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
Content-Length: 1392
Server: GSE

https://news.un.org/feed/subscribe/en/news/region/americas/feed/rss.xml

Headers:

HTTP/2 200 
date: Fri, 28 Oct 2022 17:18:27 GMT
x-content-type-options: nosniff
cache-control: public, max-age=60, s-maxage=600, stale-while-revalidate=3600
x-adv-varnish: Cache-enabled
x-ttl: 3600
access-control-allow-origin: *
content-type: application/rss+xml; charset=utf-8
x-cacheable: YES
content-encoding: gzip
vary: Accept-Encoding
x-varnish: 860816948 847250401
age: 2091
x-varnish-cache: HIT
x-cache-ttl-remaining: 1508.452
x-cache-age: 2091
x-cache-hits: 254
accept-ranges: bytes
content-length: 7148
strict-transport-security: max-age=16000000; includeSubDomains; preload;

What does NOT work

(these files are structurally identical to the ones that do, however these "auto download" in my browser vs. display as an XML document in the browser like the URLs that do work, perhaps for the reasons shown in the headers)

https://www.nasdaq.com/feed/rssoutbound?category=Commodities

Headers:

HTTP/2 200 
accept-ranges: bytes
content-language: en
content-type: application/rss+xml
server: nginx
x-age: 0
x-ah-environment: prod
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
x-generator: Drupal 9 (https://www.drupal.org)
x-request-id: v-22cb399e-56e5-11ed-88ad-9760c11edbc3
x-ua-compatible: IE=edge
content-encoding: gzip
content-length: 4650
expires: Fri, 28 Oct 2022 17:44:43 GMT
cache-control: max-age=0, no-cache, no-store
pragma: no-cache
date: Fri, 28 Oct 2022 17:44:43 GMT
vary: Accept-Encoding
server-timing: cdn-cache; desc=HIT
server-timing: edge; dur=1

--

Other than the Content-Type header (along with the auto-download vs. display browser behavior) I cannot imagine what would cause DOMDocument::load not to function. When the URL that does NOT work is attempted to load it takes about a minute before triggering the if ( !$xml->load( $url ) ) error response. This doesn't make sense to me. What is DOMDocument doing during that time? After the minute these errors are returned:

DOMDocument::load(https://www.nasdaq.com/feed/rssoutbound?category=Commodities): failed to open stream: HTTP request failed! DOMDocument::load(): I/O warning : failed to load external entity "https://www.nasdaq.com/feed/rssoutbound?category=Commodities"

Changing DOMDocument::load to DOMDocument::loadXML at least immediately returns the error response along with this one: DOMDocument::loadXML(): Start tag expected, '<' not found in Entity, line: 1

I have no problem checking to see if the document cannot load (would really prefer not to make the user sit through 60 seconds of unexplained "connectivity" issues that clearly aren't) then if not grab the content raw and parse it "by all means necessary" ... were it not for the unexplained 60 second wait time. This makes me want to just use the raw content/parse option instead of DOMDocument::load if this is going to be a regular occurrence from established RSS/XML providers.

Why does DOMDocument::load fail to properly load a syntactically identical file?

John Smith
  • 490
  • 2
  • 11
  • 1
    I could be the server does not like the user agent php is sending. That has been happening on other sites lately. I did a test with wget. That just hangs too. I'm guessing no mater what you use to down load the fee. The user agent will need to be changed. – Jason K Oct 28 '22 at 19:13
  • @JasonK Looks like you may be right. File_get_contents instead of DOMDocument::load returns the same error. cURL works however, perhaps due to simulating a browser request. For no particular reason I was under the impression DOMDocument::load is quicker and less resource intensive than a cURL request. Might this be true? – John Smith Oct 28 '22 at 19:41
  • Take a look at [stream_context_create()](https://www.php.net/manual/en/function.stream-context-create.php). This will allow you to change the header options for the native php file interactions. I don't know if curl is more process intensive or not. – Jason K Oct 28 '22 at 19:57

0 Answers0