1

I am using urlopen in urllib.request in Python 3.5.1 (64-bit version on Windows 10) to load content from www.wordreference.com for a French project. Somehow, whenever I request anything outside the domain itself, page content is instead loaded from yahoo.com.

Here, I print the first 350 characters from http://www.wordreference.com:

>>> from urllib import request
>>> page = request.urlopen("http://www.wordreference.com")
>>> content = page.read()
>>> print(content.decode()[:350])
<!DOCTYPE html>

<html lang="en">

<head>

<title>English to French, Italian, German &amp; Spanish Dictionary -
WordReference.com</title>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

<meta name="description" content="Free online dictionaries - Spanish, French,
Italian, German and more. Conjugations, audio pronunciations and

Next, I requested a specific document on the domain:

>>> page = request.urlopen("http://www.wordreference.com/enfr/test")
>>> content = page.read()
>>> print(content.decode()[:350])
<!DOCTYPE html>
<html id="atomic" lang="en-US" class="atomic my3columns  l-out Pos-r https fp
fp-v2 rc1 fp-default mini-uh-on viewer-right ltr desktop Desktop bkt201">
<head>

<title>Yahoo</title><meta http-equiv="x-dns-prefetch-control" content="on"
<link rel="dns-prefetch" href="//s.yimg.com"><link rel="preconnect"
href="//s.yimg.com"><li

The last request takes about six seconds longer to read (which could be my slow internet) and the content comes straight from http://www.yahoo.com/. I can access the above URLs fine in a web browser.

Why is this happening? Is this something related to Windows 10? I have tried this on other domains and the problem does not occur.

cp289
  • 11
  • 2
  • 2
    wordreference detects you're not using a browser and "blocks" your request by redirecting it. Consider setting a browser user agent in urllib. – Quentin Pradet Jul 05 '16 at 06:37
  • Just tried with postman, and the same code you posted and I did not experience the same. Also I suggest you use [`urllib3`](https://urllib3.readthedocs.io/en/latest/) or [`requests`](http://docs.python-requests.org/en/master/) modules – smac89 Jul 05 '16 at 06:37
  • My guess is that the request that the `urlib` library sends is misinterpreted by the server which then redirects it to Yahoo. The same redirection also happens when using `curl` to send the request: `Object moved

    Object moved to here.

    ` When using another library, such as `requests`, it works as expected.
    – DeepSpace Jul 05 '16 at 06:41
  • @QuentinPradet That seems to be the solution. After setting a user agent using `urllib.request.Request` the regular page content loads. – cp289 Jul 05 '16 at 23:38

1 Answers1

-2

I tried the following code and it's working.

import requests
page = requests.get("http://www.wordreference.com/enfr/test")
content = page.text
print(content.encode('utf-8')[:350])
Sijan Bhandari
  • 2,941
  • 3
  • 23
  • 36