I'm finishing a function to load RSS/XML in PHP using DOMDocument.
In most cases it works perfectly, however with some RSS sources it does not.
I also have a general file_get_contents-style function utilizing cURL for general web requests that also works fine. Using this I was able to compare the Headers of the various sources that do and do not work. Unless there is some functionality or configuration of DOMDocument that can easily correct this I will consider using file_get_contents or cURL to load the raw content then process it without the use of DOMDocument. Extra code aside is there any reason DOMDocument::load would be preferred over these other methods?
What works
http://feeds.marketwatch.com/marketwatch/bulletins
Headers:
HTTP/1.1 200 OK
Content-Type: text/xml; charset=utf-8
Vary: Sec-Fetch-Dest, Sec-Fetch-Mode, Sec-Fetch-Site
feedburnerv2:
Last-Modified: Fri, 28 Oct 2022 16:34:04 GMT
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Pragma: no-cache
Expires: Mon, 01 Jan 1990 00:00:00 GMT
Date: Fri, 28 Oct 2022 17:45:20 GMT
Strict-Transport-Security: max-age=31536000
Accept-CH: Sec-CH-UA-Arch, Sec-CH-UA-Bitness, Sec-CH-UA-Full-Version, Sec-CH-UA-Full-Version-List, Sec-CH-UA-Model, Sec-CH-UA-WoW64, Sec-CH-UA-Platform, Sec-CH-UA-Platform-Version
Report-To: {"group":"RaichuFeedServer","max_age":2592000,"endpoints":[{"url":"https://csp.withgoogle.com/csp/report-to/RaichuFeedServer/external"}]}
Cross-Origin-Opener-Policy: same-origin; report-to="RaichuFeedServer"
Content-Security-Policy: script-src 'report-sample' 'nonce-63RlJD8aykRdOjuRncZTYA' 'unsafe-inline';object-src 'none';base-uri 'self';report-uri /_/RaichuFeedServer/cspreport;worker-src 'self'
Cross-Origin-Resource-Policy: same-site
Permissions-Policy: ch-ua-arch=*, ch-ua-bitness=*, ch-ua-full-version=*, ch-ua-full-version-list=*, ch-ua-model=*, ch-ua-wow64=*, ch-ua-platform=*, ch-ua-platform-version=*
Content-Encoding: gzip
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
Content-Length: 1392
Server: GSE
https://news.un.org/feed/subscribe/en/news/region/americas/feed/rss.xml
Headers:
HTTP/2 200
date: Fri, 28 Oct 2022 17:18:27 GMT
x-content-type-options: nosniff
cache-control: public, max-age=60, s-maxage=600, stale-while-revalidate=3600
x-adv-varnish: Cache-enabled
x-ttl: 3600
access-control-allow-origin: *
content-type: application/rss+xml; charset=utf-8
x-cacheable: YES
content-encoding: gzip
vary: Accept-Encoding
x-varnish: 860816948 847250401
age: 2091
x-varnish-cache: HIT
x-cache-ttl-remaining: 1508.452
x-cache-age: 2091
x-cache-hits: 254
accept-ranges: bytes
content-length: 7148
strict-transport-security: max-age=16000000; includeSubDomains; preload;
What does NOT work
(these files are structurally identical to the ones that do, however these "auto download" in my browser vs. display as an XML document in the browser like the URLs that do work, perhaps for the reasons shown in the headers)
https://www.nasdaq.com/feed/rssoutbound?category=Commodities
Headers:
HTTP/2 200
accept-ranges: bytes
content-language: en
content-type: application/rss+xml
server: nginx
x-age: 0
x-ah-environment: prod
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
x-generator: Drupal 9 (https://www.drupal.org)
x-request-id: v-22cb399e-56e5-11ed-88ad-9760c11edbc3
x-ua-compatible: IE=edge
content-encoding: gzip
content-length: 4650
expires: Fri, 28 Oct 2022 17:44:43 GMT
cache-control: max-age=0, no-cache, no-store
pragma: no-cache
date: Fri, 28 Oct 2022 17:44:43 GMT
vary: Accept-Encoding
server-timing: cdn-cache; desc=HIT
server-timing: edge; dur=1
--
Other than the Content-Type header (along with the auto-download vs. display browser behavior) I cannot imagine what would cause DOMDocument::load not to function. When the URL that does NOT work is attempted to load it takes about a minute before triggering the if ( !$xml->load( $url ) )
error response. This doesn't make sense to me. What is DOMDocument doing during that time? After the minute these errors are returned:
DOMDocument::load(https://www.nasdaq.com/feed/rssoutbound?category=Commodities): failed to open stream: HTTP request failed!
DOMDocument::load(): I/O warning : failed to load external entity "https://www.nasdaq.com/feed/rssoutbound?category=Commodities"
Changing DOMDocument::load to DOMDocument::loadXML at least immediately returns the error response along with this one: DOMDocument::loadXML(): Start tag expected, '<' not found in Entity, line: 1
I have no problem checking to see if the document cannot load (would really prefer not to make the user sit through 60 seconds of unexplained "connectivity" issues that clearly aren't) then if not grab the content raw and parse it "by all means necessary" ... were it not for the unexplained 60 second wait time. This makes me want to just use the raw content/parse option instead of DOMDocument::load if this is going to be a regular occurrence from established RSS/XML providers.
Why does DOMDocument::load fail to properly load a syntactically identical file?