BackgroundInfo:
I am scraping amazon. I need to set up the session cookies before using requests.session.get() to get the final version of the page source code of a url.
Code:
import requests
# I am currently working in China, so it's cn.
# Use the homepage to get cookies. Then use it later to scrape data.
homepage = 'http://www.amazon.cn'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
response = requests.get(homepage,headers = headers)
cookies = response.cookies
#set up the Session object, so as to preserve the cookies between requests.
session = requests.Session()
session.headers = headers
session.cookies = cookies
#now begin download the source code
url = 'https://www.amazon.cn/TCL-%E7%8E%8B%E7%89%8C-L65C2-CUDG-65%E8%8B%B1%E5%AF%B8-%E6%96%B0%E7%9A%84HDR%E6%8A%80%E6%9C%AF-%E5%85%A8%E6%96%B0%E7%9A%84%E9%87%8F%E5%AD%90%E7%82%B9%E6%8A%80%E6%9C%AF-%E9%BB%91%E8%89%B2/dp/B01FXB0ZG4/ref=sr_1_2?ie=UTF8&qid=1476165637&sr=8-2&keywords=L65C2-CUDG'
response = session.get(url)
Desired Result:
When navigate to the amazon homepage in Chrome, the cookies should be something like:
As you can find in the cookies part,which I underscore in red, part of the cookies set by the response to our request to the homepage is "ubid-acbcn", which is also part of the request header, probably left from last visit.
So that is the cookie I want, which I attempted to get by the above code.
In python code, it should be a cookieJar, or a dictionary. Either way, its content should be something that contains 'ubid-acbcn' and 'session-id':
{'ubid-acbcn':'453-7613662-1073007','session-id':'455-1363863-7141553','otherparts':'otherparts'}
What I am getting instead: The 'session-id' is there, but the 'ubid-acbcn' is missing.
>>homepage = 'http://www.amazon.cn'
>>headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
>>response = requests.get(homepage,headers = headers)
>>cookies = response.cookies
>>print(cookies.get_dict()):
>>{'session-id': '456-2975694-3270026','otherparts':'otherparts'}
Related Info:
- OS: WINDOWS 10
- PYTHON: 3.5
- requests: 2.11.1
I am sorry for being a bit verbose.
What I tried and figure:
- I googled for certain keywords, but nobody seems to be facing this problem.
- I figure it might be something to do with the amazon anti-scraping measure. But other than change my headers to disguise myself as a human, there isn't much I know I should do.
- I have also entertained the possibility that tt might not be a case of missing cookie. But rather I have not set up my requests.get(homepage,headers = headers) properly, hence the response.cookie is not as expected. Given this,I have tried to copying the request header in my browser, leaving out only the cookie part, but still the response cookie is missing the 'ubid-acbcn' part. Maybe some other parameter has to be set up?