Symfony's DomCrawler does not find a specific tag

Question

I'm using DomCrawler to get data from a Google Play page and it works in 99% of cases, except I stumbled upon a page where it can not find a specific div. I check the HTML code and it is definitely there. My code is

$autoloader = require __DIR__.'\vendor\autoload.php';
use Symfony\Component\DomCrawler\Crawler;

$app_id = 'com.balintinfotech.sinhalesekeyboardfree';

$response = file_get_contents('https://play.google.com/store/apps/details?id='.$app_id);
$crawler = new Crawler($response);
echo $crawler->filter('div[itemprop="datePublished"]')->text();

When I run that specific page I get

PHP Fatal error: Uncaught InvalidArgumentException: The current node list is empty.

However, if I use any other ID, I get the desired result. What exactly is about that page that breaks DomCrawler

Does this only happen on this one page for you? I was able to get it working: `14 de marzo de 2017` (by just copy/pasting your code) — ishegg, Sep 13 '17 at 19:50
@ishegg Just on this page. I see you got your result in Spanish, so this only effects the English page. — John Baker, Sep 13 '17 at 19:57
@ishegg can you try using the following URL `https://play.google.com/store/apps/details?id=com.balintinfotech.sinhalesekeyboardfree&hl=en` — John Baker, Sep 13 '17 at 20:00

score 1 · Accepted Answer · answered Sep 13 '17 at 20:10

1

As you correctly figured out, this doesn't happen in the English version, but it does in the Spanish one.

One difference I could spot was a comment by a user saying නියමයි ඈ. There seems to be something bothering the Crawler there. If you replace a null characted (\x00) by an empty string, it correctly gets what you're looking for:

<?php
$app_id = 'com.balintinfotech.sinhalesekeyboardfree';
$response = file_get_contents('https://play.google.com/store/apps/details?hl=en&id='.$app_id);
$response = str_replace("\x00", "", $response);
$crawler = new Symfony\Component\DomCrawler\Crawler($response);
var_dump($crawler->filter('div[itemprop="datePublished"]')->text()); // string(14) "March 14, 2017"

I'll try to look more into this.

answered Sep 13 '17 at 20:10

ishegg

9,685
3
16
31

Nice catch, I wonder if it's a bug in DomCrawler. Had to delete my previous reply, as encoding to UTF-8 did not actually work. – John Baker Sep 13 '17 at 21:02
It's not. Notice it's `file_get_contents()` that truncates the result when it finds the null character, `DomCrawler` is doing its job just fine. So the problem seems to be on the PHP side of things. It might even go deeper. – ishegg Sep 13 '17 at 23:24
it doesn't get truncated on my end. I get the whole HTML. – John Baker Sep 13 '17 at 23:33

Symfony's DomCrawler does not find a specific tag

1 Answers1