0

I have an app visiting URLs automatically through links. It works good as long as the URL doesn't contain Unicode.

For example, I have a link:

<a href="https://example.com/catalog/kraków/list.html">Kraków</a>

The link contains just pure ó character in the source. When I try to do:

$href = $crawler->filter('a')->attr('href');
$html = file_get_contents($href);

It returns 404 error. If I visit that URL in the browser, it's fine, because the browser replaces ó to %C3%B3.

What should I do to make is possible to visit that URL via file_get_contents()?

Robo Robok
  • 21,132
  • 17
  • 68
  • 126
  • possible duplicate of https://stackoverflow.com/questions/31097744/file-get-contents-fails-with-special-characters-in-url ? – Manzolo Sep 05 '19 at 18:36

1 Answers1

1

urlencode can be used to encode url parts. The following snippet extracts the path /catalog/kraków/list.html and encodes the contents: catalog, kraków and list.html instead of the entire url to preserve the path.

Checkout the following solution:

function encodeUri($uri){
    $urlParts = parse_url($uri);

    $path = implode('/', array_map(function($pathPart){
        return strpos($pathPart, '%') !== false ? $pathPart : urlencode($pathPart);
    },explode('/', $urlParts['path'])));

    $query = array_key_exists('query', $urlParts) ? '?' . $urlParts['query'] : '';

    return $urlParts['scheme'] . '://' . $urlParts['host']  . $path . $query;
}


$href = $crawler->filter('a')->attr('href');
$html = file_get_contents(encodeUri($href)); // outputs: https://example.com/catalog/krak%C3%B3w/list.html

parse_url docs: https://www.php.net/manual/en/function.parse-url.php

MaartenDev
  • 5,631
  • 5
  • 21
  • 33
  • It loses query string. I understand the idea and I'm going to implement it on the path part of the URL. – Robo Robok Sep 05 '19 at 18:45
  • Updated the answer to include the query path :) @RoboRobok – MaartenDev Sep 05 '19 at 18:47
  • Now it causes error on missing query string and adds (in theory) redundant `?` if there's no query. – Robo Robok Sep 05 '19 at 18:49
  • Sorry about that, fixed it @RoboRobok – MaartenDev Sep 05 '19 at 18:49
  • It looks better and better, but you know. Then there's port etc. Also, if the URL is already encoded, it will cause trouble too. Isn't there any better way? – Robo Robok Sep 05 '19 at 18:51
  • Fair points, the port requirement are easy to implement using the `parse_url` method. https://www.php.net/manual/en/function.parse-url.php. I added a safe guard for double encoding @RoboRobok – MaartenDev Sep 05 '19 at 18:55
  • That condition is not always good. Query string can be urlencoded and the path not, for example. – Robo Robok Sep 05 '19 at 18:57
  • Ah you are right, moved it to the Query string, urlencoding the domain is not supported in http so that edge case doesn't have to be covered. @RoboRobok – MaartenDev Sep 05 '19 at 19:01