Python article collection from a website that requires cookies

Question

I'm trying to collect articles from the databases at infoweb.newsbank.com for research I'm doing at University. So far this is my code:

from bs4 import BeautifulSoup
import requests
import urllib
from requests import session
import http.cookiejar


mainLink  = "http://infoweb.newsbank.com.proxy.lib.uiowa.edu/iw-search/we/InfoWeb?p_product=AWNB&p_theme=aggregated5&p_action=doc&p_docid=14D12E120CD13C18&p_docnum=2&p_queryname=4"




def articleCrawler(mainUrl):
    response = urllib.request.urlopen(mainUrl)
    soup = BeautifulSoup(response)
    linkList = []
    for link in soup.find_all('a'):
        print(link)

articleCrawler(mainLink)

Unfortunatrly I get back this response:

<html>
<head>
<title>Cookie Required</title>
</head>
<body>
This is cookie.htm from the doc subdirectory.
<p>
<hr>
<p>

Licensing agreements for these databases require that access be extended
only to authorized users.  Once you have been validated by this system,
a "cookie" is sent to your browser as an ongoing indication of your authorization to
access these databases.  It will only need to be set once during login.
<p>
As you access databases, they may also use cookies.  Your ability to use those databases
may depend on whether or not you allow those cookies to be set.
<p>
To login again, click <a href="login">here</a>.
</p></p></p></hr></p></body>
</html>

<a href="login">here</a>

I've tried using http.cookiejar, but I am not familiar with the library. I am using Python 3. Does anyone know how to accept the cookie and access the article? Thank you.

score 2 · Answer 1 · answered Apr 11 '14 at 22:27

I'm not familiar with Python3, but in Python2 the standard way to accept cookies is to incorporate a HTTPCookieProcessor as one of the handlers in your OpenerDirector.

So, something like this:

import cookielib, urllib, urllib2
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookielib.CookieJar()))

opener is now ready to open a URL (presumably with a username and password) and place any cookies it receives into its integrated CookieJar:

params = urllib.urlencode({'username': 'someuser', 'password': 'somepass'})
opener.open(LOGIN_URL, params)

If the login was successful, opener will now have whatever authentication token the server gave it sitting around in cookie form. Then you just access the link you wanted in the first place:

f = opener.open(mainLink)

Again, not sure how this translates for Python3, but I think you at least want cookielib.CookieJar rather than http.cookiejar. I think the latter is for creating HTTP cookie content as a server rather than for receiving cookie content as a client.

Ok, I will check that out and comment back later. Thank you. — Solsma Dev, Apr 11 '14 at 22:44

Python article collection from a website that requires cookies

1 Answers1