2

I'm would like to extract all the team name and link from this page:

https://www.transfermarkt.fr/ligue-1/startseite/wettbewerb/FR1

I'm using DOMXpath to match element but with the following code it do not return me anything.

function get_data($url) {
    $ch = curl_init();
    $timeout = 5;
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)");
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST,false);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER,false);
    curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
    $data = curl_exec($ch);
    curl_close($ch);
    return $data;
}

$url = 'https://www.transfermarkt.fr/ligue-1/startseite/wettbewerb/FR1';
$html = get_data($url);

$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);

foreach($xpath->query('//*[contains(concat( " ", @class, " " ), concat( " ", "hide-for-pad", " " ))]//*[contains(concat( " ", @class, " " ), concat( " ", "tooltipstered", " " ))]') as $v) {
    echo $v->getAttribute("href") . PHP_EOL;
}

Do you know why please ?

Thanks a any help.

PacPac
  • 267
  • 2
  • 8

1 Answers1

1

Sometimes the HTML is altered by Javascript once the page is loaded. I've had a look at the page and I think you should be able to get the details from the following (please check the correct URL is displayed)...

$teams = $xpath->query('//td[@class="zentriert no-border-rechts"]/a[contains(concat( " ", @class, " " ), concat( " ", "vereinprofil_tooltip", " " ))]');
foreach($teams as $v) {
    echo $v->getAttribute("href") . " - ";
    echo $v->firstChild->getAttribute("alt").PHP_EOL;
}

This may give duplicates, so a possibility is to create a list of the teams and URL's like this...

$teams = [];
foreach($teams as $v) {
    $teams[$v->firstChild->getAttribute("alt")] = $v->getAttribute("href");
}

Which will give you a list of team names (as the key) and the URL as the value.

Nigel Ren
  • 56,122
  • 11
  • 43
  • 55
  • It works. Thanks. Could you please explain me more why your query is working and mine not? Also, with your actual code I get 40 teams and not 20. – PacPac Jun 18 '19 at 15:37
  • Rather than looking at the HTML through a browser, I will always save the HTML retrieved and look through that for markers (such as the classes for the various elements). I looked for the string `tooltipstered` in the HTML and I can't find it anywhere, so there is obviously something going on there. In my code I've found something which looks as though it matches. There is a problem in that if the server code is changed, then this may fail at any point. – Nigel Ren Jun 18 '19 at 15:41
  • Last question: why this query can't work? `//td[@class="zentriert no-border-rechts"]/a[@class="vereinprofil_tooltip tooltipstered"]` ? – PacPac Jun 18 '19 at 15:57
  • If you look at the source code of the HTML (at least the bit I'm looking at) only has ` – Nigel Ren Jun 18 '19 at 17:50