1

The goal:

  • import an external XML file (for this example, it's inline)
  • get the < loc >, save into variable
  • find the < xhtml:link > that has the href-lang="fr-ca" attribute, get the href value, save into variable
  • insert both in the DB

Problem I have: I can not get PHP to even recognize that xhtml:link is a childNode of the < url > item; even when I simply spit out the nodeValue for the < url >, it omits all < xhtml:link > child nodes.

Code I am using/tried:

<?php
$xml = <<< XML
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xmlns:xhtml="http://www.w3.org/1999/xhtml">
<url xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <loc>https://www.example.com/ca/en/cat/categories/series/07660/</loc>
  <lastmod>2018-11-07</lastmod>
  <changefreq>daily</changefreq>
  <priority>1.0</priority>
  <xhtml:link xmlns:xhtml="http://www.w3.org/1999/xhtml" rel="alternate" hreflang="en-ae" href="https://www.example.com/ae/en/cat/categories/series/07660/" />
  <xhtml:link xmlns:xhtml="http://www.w3.org/1999/xhtml" rel="alternate" hreflang="de-at" href="https://www.example.com/at/de/cat/07660/" />
  <xhtml:link xmlns:xhtml="http://www.w3.org/1999/xhtml" rel="alternate" hreflang="en-au" href="https://www.example.com/au/en/cat/categories/series/07660/" />
  <xhtml:link xmlns:xhtml="http://www.w3.org/1999/xhtml" rel="alternate" hreflang="en-ca" href="https://www.example.com/ca/en/cat/categories/series/07660/" />
  <xhtml:link xmlns:xhtml="http://www.w3.org/1999/xhtml" rel="alternate" hreflang="fr-ca" href="https://www.example.com/ca/fr/cat/categories/series/07660/" />
</url>
<url xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <loc>https://www.example.com/ca/en/cat/categories/series/07683/</loc>
  <lastmod>2018-11-07</lastmod>
  <changefreq>daily</changefreq>
  <priority>1.0</priority>
  <xhtml:link xmlns:xhtml="http://www.w3.org/1999/xhtml" rel="alternate" hreflang="en-ae" href="https://www.example.com/ae/en/cat/categories/series/07683/" />
  <xhtml:link xmlns:xhtml="http://www.w3.org/1999/xhtml" rel="alternate" hreflang="de-at" href="https://www.example.com/at/de/cat/07683/" />
  <xhtml:link xmlns:xhtml="http://www.w3.org/1999/xhtml" rel="alternate" hreflang="en-au" href="https://www.example.com/au/en/cat/categories/series/07683/" />
  <xhtml:link xmlns:xhtml="http://www.w3.org/1999/xhtml" rel="alternate" hreflang="fr-be" href="https://www.example.com/be/fr/collections/07683/" />
  <xhtml:link xmlns:xhtml="http://www.w3.org/1999/xhtml" rel="alternate" hreflang="nl-be" href="https://www.example.com/be/nl/collecties/07683/" />
  <xhtml:link xmlns:xhtml="http://www.w3.org/1999/xhtml" rel="alternate" hreflang="en-bh" href="https://www.example.com/bh/en/cat/07683/" />
  <xhtml:link xmlns:xhtml="http://www.w3.org/1999/xhtml" rel="alternate" hreflang="en-ca" href="https://www.example.com/ca/en/cat/categories/series/07683/" />
  <xhtml:link xmlns:xhtml="http://www.w3.org/1999/xhtml" rel="alternate" hreflang="fr-ca" href="https://www.example.com/ca/fr/cat/categories/series/07683/" />
</url>
</urlset>
XML;

$urlsxml = new DOMDocument;
$urlsxml->loadXML($xml);
$urls = $urlsxml->getElementsByTagName('url');

for ($i = 0; $i < $urls->length; $i++) {

      echo $urls->item($i)->nodeValue;
      echo $urls->getElementsByTagName("xhtml:link")->attributes->getNamedItem("hreflang")->nodeValue;

      // INSERT INTO DB

}

?>

Out of ideas; any help would be appreciated.

taketheleap
  • 107
  • 1
  • 1
  • 12

3 Answers3

1

The XML uses two namespaces http://www.sitemaps.org/schemas/sitemap/0.9 without an alias and http://www.w3.org/1999/xhtml with the alias xhtml. To read XML with namespaces you should use the *NS variants of the DOM methods.

$urls = $urlsxml->getElementsByTagNameNS(
  'http://www.sitemaps.org/schemas/sitemap/0.9', 'url'
);

$urls[$i]->getElementsByTagNameNS('http://www.w3.org/1999/xhtml', 'link');

The first argument is the namespace URI, the second argument the local name (node name with the prefix). It would be a good idea to use a constant/variable for the namespace URIs in this case.

A more comfortable option is Xpath. It allows you to use location paths and conditions to fetch nodes.

$document = new DOMDocument;
$document->loadXML($xml);
// create an xpath instance for the document
$xpath = new DOMXpath($document);
// register the namespaces for your own prefixes
$xpath->registerNameSpace('s', 'http://www.sitemaps.org/schemas/sitemap/0.9');
$xpath->registerNameSpace('x', 'http://www.w3.org/1999/xhtml');

// iterate all sitemap url elements
foreach ($xpath->evaluate('//s:url') as $url) {
  $data = [
    // get the sitemap loc child element as a string
    'loc' => $xpath->evaluate('string(s:loc)', $url),
    // get the href attribute of the xhtml link element (with language condition)
    'fr-ca' => $xpath->evaluate('string(x:link[@hreflang="fr-ca"]/@href)', $url),
  ];
  var_dump($data);
}

Output:

array(2) { 
  ["loc"]=> 
  string(58) "https://www.example.com/ca/en/cat/categories/series/07660/" 
  ["fr-ca"]=> 
  string(58) "https://www.example.com/ca/fr/cat/categories/series/07660/" 
} 
array(2) { 
  ["loc"]=> 
  string(58) "https://www.example.com/ca/en/cat/categories/series/07683/" 
  ["fr-ca"]=> 
  string(58) "https://www.example.com/ca/fr/cat/categories/series/07683/" 
}

The string() in Xpath casts the first node in a list into a string. It allows you to avoid the explicit access to the node object properties. For example $xpath->evaluate('s:loc', $url)->item(0)->textContent; can be written as $xpath->evaluate('string(s:loc)', $url);. Unlike the property access the Xpath cast will not fail with an error if no matching node exists. It will return an empty string.

ThW
  • 19,120
  • 3
  • 22
  • 44
0

The actual act of inserting in your db is beyond the scope of the code here but to parse the XML you can do something as simple as this ( based upon a locally saved copy of the XML rather than using the heredoc syntax ) ~ the name of the file was only for identification.

Initially I thought this would require the namespace to be registered and used in the XPath expressions but that was not the case - a simple XPath query for each url node was sufficient ~ using the parent node url as the reference node to the query.

$file='so-stack-xml-namespace.xml';


libxml_use_internal_errors( true );
$dom=new DOMDocument;
$dom->validateOnParse=true;
$dom->recover=true;
$dom->strictErrorChecking=true;
$dom->load( $file );
libxml_clear_errors();

$xp=new DOMXPath( $dom );

$urls=$dom->getElementsByTagName('url');
foreach( $urls as $url ){
    $href=$url->nodeValue;
    $frca=$xp->query('xhtml:link[@hreflang="fr-ca"]',$url)->item(0)->getAttribute('href');
    /* do something with the variables...add to DB */
    printf('href:%s<br />frca:%s<br /><br />', $href,$frca);
}
Professor Abronsius
  • 33,063
  • 5
  • 32
  • 46
-1

If you put your XML file into a variable you can extract values with a loop:

$xml = file_get_contents("your_xml_file");
$tags = explode("<", $xml);
$loc = "not found";
$frhref = "not found";

foreach ($tags as $tag){
    if(strpos($tag, "loc>") === 0){
        $loc = substr($tag, 4);
    }
    if(strpos($tag, "xhtml:link") === 0){
        $at = strpos($tag, "hreflang") + 9;
        $lang = substr($tag, $at, 7);
        if($lang == '"fr-ca"'){
            $at = strpos($tag, "href=") + 6;
            $_href = substr($tag, $at);
            $until = strpos($_href, '"');
            $frhref = substr($_href, 0, $until);
        }
    }
}
echo $loc, " ", $frhref; //put them in your db

I tested it with your content: https://3v4l.org/1laON

Marco somefox
  • 370
  • 2
  • 10