0

I want to use Python requests with splash browser (https://splash.readthedocs.io/en/stable/) and custom headers to crawl some data from a website. However, before starting the crawling itself I decided to check on this website http://xhaus.com/headers what headers I send. As a result, I see that I am not sending those headers I want to send.

import requests

def headers():

    headers = requests.utils.default_headers()

    headers.update({
        'User-Agent': random_user_agent()
        })
    return headers

def random_user_agent():
    with open('user-agents.txt','r') as f:
        user_agents = f.readlines()
        user_agents = [h.rstrip('\n') for h in user_agents]
        random_index = random.randint(0,len(user_agents)-1)
        ua = user_agents[random_index]
        return ua
splash = 'http://localhost:8050/render.html'
headers = headers()
url_h = 'http://xhaus.com/headers'
page = requests.get(splash, params={'url':url_h,},headers=headers)

After I run this code, I have the following user agent:

{'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}

However, when I check it by the website I mentioned, it shows me a different user agent:

soup = BeautifulSoup(page.text)
print soup.prettify()

...

<td class="even">
       User-Agent
      </td>
      <td class="even">
       <b>
        Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/538.1 (KHTML, like Gecko) splash Safari/538.1
       </b>
      </td>

...
jwodder
  • 54,758
  • 12
  • 108
  • 124
Ostap Didenko
  • 446
  • 2
  • 6
  • 19
  • 1
    What is `splash` here? If I do `page = requests.get(url_h, headers=headers)` with your other code as it is, I am getting correct user-agent. – Vikas Ojha Aug 28 '17 at 15:58
  • That works for me as well! The trick is to run it with splash. It's a browser for rendering Javascript https://splash.readthedocs.io/en/stable/ – Ostap Didenko Aug 28 '17 at 20:18
  • I am not able to find the documentation for using `Splash` with `Requests`. I think you might need to use this - http://splash.readthedocs.io/en/stable/scripting-ref.html?highlight=user%20agent#splash-set-user-agent – Vikas Ojha Aug 29 '17 at 06:17
  • Thanks, I haven't found it either. What about the link you provided, I've also checked it before asking the question here. However, I have no idea of how to use it. I'd like to find an example of how can that be used together with Python, but not with Scrapy. – Ostap Didenko Aug 29 '17 at 07:32

0 Answers0