1

I'm using DomCrawler to get data from a Google Play page and it works in 99% of cases, except I stumbled upon a page where it can not find a specific div. I check the HTML code and it is definitely there. My code is

$autoloader = require __DIR__.'\vendor\autoload.php';
use Symfony\Component\DomCrawler\Crawler;

$app_id = 'com.balintinfotech.sinhalesekeyboardfree';

$response = file_get_contents('https://play.google.com/store/apps/details?id='.$app_id);
$crawler = new Crawler($response);
echo $crawler->filter('div[itemprop="datePublished"]')->text();

When I run that specific page I get

PHP Fatal error: Uncaught InvalidArgumentException: The current node list is empty.

However, if I use any other ID, I get the desired result. What exactly is about that page that breaks DomCrawler

John Baker
  • 425
  • 4
  • 22
  • Does this only happen on this one page for you? I was able to get it working: `14 de marzo de 2017` (by just copy/pasting your code) – ishegg Sep 13 '17 at 19:50
  • @ishegg Just on this page. I see you got your result in Spanish, so this only effects the English page. – John Baker Sep 13 '17 at 19:57
  • @ishegg can you try using the following URL `https://play.google.com/store/apps/details?id=com.balintinfotech.sinhalesekeyboardfree&hl=en` – John Baker Sep 13 '17 at 20:00

1 Answers1

1

As you correctly figured out, this doesn't happen in the English version, but it does in the Spanish one.

One difference I could spot was a comment by a user saying නියමයි ඈ. There seems to be something bothering the Crawler there. If you replace a null characted (\x00) by an empty string, it correctly gets what you're looking for:

<?php
$app_id = 'com.balintinfotech.sinhalesekeyboardfree';
$response = file_get_contents('https://play.google.com/store/apps/details?hl=en&id='.$app_id);
$response = str_replace("\x00", "", $response);
$crawler = new Symfony\Component\DomCrawler\Crawler($response);
var_dump($crawler->filter('div[itemprop="datePublished"]')->text()); // string(14) "March 14, 2017"

I'll try to look more into this.

ishegg
  • 9,685
  • 3
  • 16
  • 31
  • Nice catch, I wonder if it's a bug in DomCrawler. Had to delete my previous reply, as encoding to UTF-8 did not actually work. – John Baker Sep 13 '17 at 21:02
  • It's not. Notice it's `file_get_contents()` that truncates the result when it finds the null character, `DomCrawler` is doing its job just fine. So the problem seems to be on the PHP side of things. It might even go deeper. – ishegg Sep 13 '17 at 23:24
  • it doesn't get truncated on my end. I get the whole HTML. – John Baker Sep 13 '17 at 23:33