2

[https://tools.usps.com/go/TrackConfirmAction.action?tRef=fullpage&tLc=1&text28777=&tLabels=LN594080445CN]

import urllib
url='https://tools.usps.com/go/TrackConfirmAction.action?tRef=fullpage&tLc=1&text28777=&tLabels=LN594080445CN'  
page=urllib.urlopen(url).read()

but get page=''

my version:2.7.6

Roger
  • 85
  • 1
  • 9

3 Answers3

2

Hum I try with the python package requests and first have an error : requests.exceptions.TooManyRedirects: Exceeded 30 redirects.

It seems it redirects from url to another and loop like that. Maybe it failed with urllib. Also I checked doc of urlopen and seems to have some problem with https request.

Whatever I found this piece of code which fix your issue :

import requests

url='https://tools.usps.com/go/TrackConfirmAction.action?tRef=fullpage&tLc=1&text28777=&tLabels=LN594080445CN'

s = requests.session()
s.headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36'

response = s.get(url)
print response.text

Creating the session fix the error about the max redirection error. More information in this question. In response.text you have all the code page, which what you want I guess.

Of course you need to add YOUR specific user agent. For the example there is mine but yours might be different. You can find it here hope it helps !

Community
  • 1
  • 1
Bestasttung
  • 2,388
  • 4
  • 22
  • 34
1

I tried this URL and found the error is due to a HTTP Error 301. It should be caused by the anti-spider mechanism of this website. You have to design a sophisticated user agent to get the page, instead of just a simple urlopen.

Timothy
  • 4,467
  • 5
  • 28
  • 51
1

I suggest not using urllib in any modern Python context. Use "Requests" ("HTTP for Humans") instead.

But before that, as @Skyler said, the result is a redirect, and your first stop should be to see what curl reports:

$ curl -I 'https://tools.usps.com/go/TrackConfirmAction.action?tRef=fullpage&tLc=1&text28777=&tLabels=LN594080445CN\]'
HTTP/1.1 301 Moved Permanently
Server: AkamaiGHost
Content-Length: 0
Location: https://www.usps.com/root/global/server_responses/webtools-msg.htm
Date: Wed, 31 Dec 2014 10:43:14 GMT
Connection: keep-alive

No big deal, but you can see that the URL it redirects to states:

To learn about integrating the free Postal Service® Address and Tracking API's into your application, please visit www.usps.com/webtools.

Also fair enough. I suggest going there and signing up. No point scraping HTML when there's a proper way to do it.

BUT, if you really want to pull the raw HTML via code: first get it working via Curl.

Pull up the Chrome developer tools and reload the page. Right click and look for "copy as CURL". You can edit the link. The following worked for me though it could probably be trimmed more:

curl 'https://tools.usps.com/go/TrackConfirmAction.action?tRef=fullpage&tLc=1&text28777=&tLabels=LN594080445CN\]' \
-H 'Accept-Encoding: gzip, deflate, sdch' \
-H 'Accept-Language: en-US,en;q=0.8' \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36' \
-H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' \
--compressed

This can be trimmed. The code below works with the nice requests module:

import requests

headers = {
    'Accept-Language': 'en-US,en;q=0.8',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
}
r = requests.get('https://tools.usps.com/go/TrackConfirmAction.action?tRef=fullpage&tLc=1&text28777=&tLabels=LN594080445CN]', headers=headers)

print "Status: %s" % r.status_code
print "Content-type: %s" % r.headers['content-type']
print "Content length: %d" % len(r.text)

Running:

$ python demo.py
Status: 200
Content-type: text/html
Content length: 55142

Even cleaner:

params = {
    'tRef': 'fullpage',
    'tLc': '1',
    'text28777': '',
    'tLabels': 'LN594080445CN]',
}

r = requests.get('https://tools.usps.com/go/TrackConfirmAction.action',
        params=params,
        headers=headers)

As I said though, I think this is not the right choice. Use the USPS API.

Andrew E
  • 7,697
  • 3
  • 42
  • 38