4

I'm learning Python, and I'm trying to request access to a website using the command requests. I'm doing the following:

import requests
requests.get("http://www.charitystars.com")

However I get <Response [504]>, which should be an error because the soup command soup = BeautifulSoup(r.content) returns an empty line. I tried with other websites and I get <Response [200]>, and the soup works. So I wonder why the command doesn't work on the first website, and what Response 504 actually means.

tony
  • 83
  • 1
  • 1
  • 7
  • https://en.wikipedia.org/wiki/List_of_HTTP_status_codes – jwodder Feb 02 '17 at 23:40
  • @jwodder Thank you. Still, I don't get it. what does it mean? is it just temporarily down? Or there is a way to work this thing out? – tony Feb 02 '17 at 23:46
  • 1
    `5xx` mostly means that server has some internal problem and you have to way till admins do something with this problem. – furas Feb 02 '17 at 23:53
  • @furas Ok, so it is a problem on their end, not on mine. for example, I read somewhere that certain website require authorization in order to scrape the data. (I'm a beginner, sorry) – tony Feb 03 '17 at 00:03
  • every page is differnt and may need different solution - some checks `user-agent` to correctly display data. You may need `authorization` if you use API - special urls to get pure data as JSON without all HTML. – furas Feb 03 '17 at 00:10

2 Answers2

9

This page doesn't like scripts/bots and it checks header user-agent.

It can also need this information to display correct page - different for desktop, tablet, smartfon.

import requests

headers = {'User-Agent': 'Mozilla/5.0'}

r = requests.get("http://www.charitystars.com/", headers=headers)

print(r.status_code)

BTW: requests as default uses "User-Agent": "python-requests/2.12.1"

You can use portal http://httpbin.org to see your requests.

import requests

r = requests.get("http://httpbin.org/get")

print(r.text)
furas
  • 134,197
  • 12
  • 106
  • 148
  • 1
    could you please explain me why it returns a 200 code if I specify the headers? Thank you! – tony Feb 03 '17 at 00:05
  • some servers check this header to recognize your browser and its capabilities - and then they can use different methods to display page. They use it also to recognize scripts/bots and refuse access. – furas Feb 03 '17 at 00:07
  • BTW: try `r = requests.get("http://httpbin.org/get")` and `print(r.text)` and you see that `requests` as default use `"User-Agent": "python-requests/2.12.1"` – furas Feb 03 '17 at 00:16
0

I got error 504 for load balance timeout. The solution was to run the affected function on the background. My cloud provider offeres that, check for your case.

Also, your cloud provider may be denying access to that website. Check if they may have a white list in place.

Hope it helps.

Al Martins
  • 431
  • 5
  • 13