16

So I'm looking into urllib3 because it has connection pooling and is thread safe (so performance is better, especially for crawling), but the documentation is... minimal to say the least. urllib2 has build_opener so something like:

#!/usr/bin/python
import cookielib, urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
r = opener.open("http://example.com/")

But urllib3 has no build_opener method, so the only way I have figured out so far is to manually put it in the header:

#!/usr/bin/python
import urllib3
http_pool = urllib3.connection_from_url("http://example.com")
myheaders = {'Cookie':'some cookie data'}
r = http_pool.get_url("http://example.org/", headers=myheaders)

But I am hoping there is a better way and that one of you can tell me what it is. Also can someone tag this with "urllib3" please.

bigredbob
  • 1,847
  • 4
  • 19
  • 19
  • 1
    @bigredbob, tagged as you asked. I've looked at urllib3's sources and it seems to have none of the tweaks and turns of urllib2, including `Opener` objects, so I doubt there's a magic wand for you. Let's hope it matures with time, as it's pretty unripe as of now!-) – Alex Martelli Mar 11 '10 at 06:08

6 Answers6

15

You're correct, there's no immediately better way to do this right now. I would be more than happy to accept a patch if you have a congruent improvement.

One thing to keep in mind, urllib3's HTTPConnectionPool is intended to be a "pool of connections" to a specific host, as opposed to a stateful client. In that context, it makes sense to keep the tracking of cookies outside of the actual pool.

  • shazow (the author of urllib3)
shazow
  • 17,147
  • 1
  • 34
  • 35
  • 1
    If I knew how to patch this I would love to, but I'm not that good. It's good to know I'm not doing it wrong at least. – bigredbob Mar 11 '10 at 08:03
4

Is there not a problem with multiple cookies?

Some servers return multiple Set-Cookie headers, but urllib3 stores the headers in a dict and a dict does not allow multiple entries with the same key.

httplib2 has a similar problem.

Or maybe not: it turns out that the readheaders method of the HTTPMessage class in the httplib package -- which both urllib3 and httplib2 use -- has the following comment:

If multiple header fields with the same name occur, they are combined according to the rules in RFC 2616 sec 4.2:

    Appending each subsequent field-value to the first, each separated
    by a comma. The order in which header fields with the same field-name
    are received is significant to the interpretation of the combined
    field value.

So no headers are lost.

There is, however, a problem if there are commas within a header value. I have not yet figured out what is going on here, but from skimming RFC 2616 ("Hypertext Transfer Protocol -- HTTP/1.1") and RFC 2965 ("HTTP State Management Mechanism") I get the impression that any commas within a header value are supposed to be quoted.

  • 1
    RFC6265 says Set-Cookie has to be special cased to handle this. httplib is not doing so in python 2 ... see http://bugs.python.org/issue1660009 – reteptilian Jan 14 '16 at 18:56
1

You need to set 'Cookie' not 'Set-Cookie', 'Set-Cookie' set by web server.

And Cookies are one of headers, so its nothing wrong with doing that way.

YOU
  • 120,166
  • 34
  • 186
  • 219
1

Given a CookieJar and a PoolManager:

# A dummy Request used to hold request data in a form understood by CookieJar
dummy_request = urllib.request.Request(url, headers={"X-Example": "client-specified-header"})

# Add a Cookie header to dummy_request based on contents of the CookieJar
jar.add_cookie_header(dummy_request)

# Actually make the request with urllib3
response = pool_manager.request("GET", url, headers=dict(dummy_request.header_items()), redirect=False)

# Populate the CookieJar with any new cookies
jar.extract_cookies(response, dummy_request)

Note that it's important to disable redirects, because otherwise urllib3 will follow the redirects, disclosing any cookies intended for the original host to the redirect target hosts!

Sam Morris
  • 1,858
  • 1
  • 17
  • 18
0

You should use the requests library. It uses urllib3 but makes things like adding cookies trivial.

https://github.com/kennethreitz/requests

import requests
r1 = requests.get(url, cookies={'somename':'somevalue'})
print(r1.content)
rd108
  • 631
  • 2
  • 7
  • 14
  • Is this supposed to add a cookie to my browser? If so, what are some reasons it would not work? I tested this on my local website running on apache server, Windows 10, Python 2.75. – Chris Nielsen Aug 08 '17 at 21:51
  • More to the point you'd use the `requests.Session` class, which manages cookies on behalf of the client. – Sam Morris Mar 05 '23 at 17:21
-1

You can use a code like this:

def getHtml(url):
    http = urllib3.PoolManager()
    r = http.request('GET', url, headers={'User-agent':'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.16 Safari/537.36','Cookie':'cookie_name=cookie_value'})
    return r.data #HTML

You should replace cookie_name and cookie_value

Adrian B
  • 1,490
  • 1
  • 19
  • 31
  • I'll just use the requests library which lets you put the cookies right in the request and does propagate them. – Walt Howard Feb 23 '21 at 20:59