4

BackgroundInfo:

I am scraping amazon. I need to set up the session cookies before using requests.session.get() to get the final version of the page source code of a url.

Code:

import requests

# I am currently working in China, so it's cn. 
# Use the homepage to get cookies. Then use it later to scrape data.
homepage = 'http://www.amazon.cn'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
response = requests.get(homepage,headers = headers)
cookies = response.cookies

#set up the Session object, so as to preserve the cookies between requests.
session = requests.Session()
session.headers = headers
session.cookies = cookies

#now begin download the source code
url = 'https://www.amazon.cn/TCL-%E7%8E%8B%E7%89%8C-L65C2-CUDG-65%E8%8B%B1%E5%AF%B8-%E6%96%B0%E7%9A%84HDR%E6%8A%80%E6%9C%AF-%E5%85%A8%E6%96%B0%E7%9A%84%E9%87%8F%E5%AD%90%E7%82%B9%E6%8A%80%E6%9C%AF-%E9%BB%91%E8%89%B2/dp/B01FXB0ZG4/ref=sr_1_2?ie=UTF8&qid=1476165637&sr=8-2&keywords=L65C2-CUDG'
response = session.get(url)

Desired Result:

When navigate to the amazon homepage in Chrome, the cookies should be something like:

Amazon HomePage

As you can find in the cookies part,which I underscore in red, part of the cookies set by the response to our request to the homepage is "ubid-acbcn", which is also part of the request header, probably left from last visit.

So that is the cookie I want, which I attempted to get by the above code.

In python code, it should be a cookieJar, or a dictionary. Either way, its content should be something that contains 'ubid-acbcn' and 'session-id':

{'ubid-acbcn':'453-7613662-1073007','session-id':'455-1363863-7141553','otherparts':'otherparts'}

What I am getting instead: The 'session-id' is there, but the 'ubid-acbcn' is missing.

>>homepage = 'http://www.amazon.cn'
>>headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
>>response = requests.get(homepage,headers = headers)
>>cookies = response.cookies
>>print(cookies.get_dict()):
>>{'session-id': '456-2975694-3270026','otherparts':'otherparts'}

Related Info:

  • OS: WINDOWS 10
  • PYTHON: 3.5
  • requests: 2.11.1

I am sorry for being a bit verbose.

What I tried and figure:

  1. I googled for certain keywords, but nobody seems to be facing this problem.
  2. I figure it might be something to do with the amazon anti-scraping measure. But other than change my headers to disguise myself as a human, there isn't much I know I should do.
  3. I have also entertained the possibility that tt might not be a case of missing cookie. But rather I have not set up my requests.get(homepage,headers = headers) properly, hence the response.cookie is not as expected. Given this,I have tried to copying the request header in my browser, leaving out only the cookie part, but still the response cookie is missing the 'ubid-acbcn' part. Maybe some other parameter has to be set up?
Sam Chan
  • 510
  • 1
  • 8
  • 20

2 Answers2

0

You're trying to get cookies from simple "nameless" GET request. But if to sent it "on behalf" of Session you can get required ubid-acbcn value:

session = requests.Session()
homepage = 'http://www.amazon.cn'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
response = session.get(homepage,headers = headers)
cookies = response.cookies
print(cookies.get_dict())

Output:

{'ubid-acbcn': '456-2652288-5841140' ...}
Andersson
  • 51,635
  • 17
  • 77
  • 129
  • 1
    It's strange. I have tried this today, but it did not work. Since you are clearly getting what I what. Something must be wrong with my attempt. I have a VPN on the background, but tried to disable it, still won't get the desired result. Later I notice it might be that I have browsed the website, so I go to Chrome setting to delete the cookies for amazon.cn specifically, it still won't work. – Sam Chan Dec 15 '16 at 12:32
0

The cookies being set are from other pages/resources, probably loaded by JavaScript code. So you probably need to used selenium web driver for it. Check out the link for detail discussion.

not getting all cookie info using python requests module

Jawad Ahmad Khan
  • 279
  • 6
  • 19