1

I'm trying to use urllib to get content from this url:"https://blockexplorer.com/block-index/0" . But when the browser load this link, it will be redirect to another link "https://blockexplorer.com/block/000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f".

Here is my code:

import urllib

link = "https://blockexplorer.com/block-index/0"
f = urllib.urlopen(link)
myfile = f.read()
print myfile

But I get the message "Cannot GET /block-index/0". So could I get the content of page after parsed with block index as above.

Please help me solve this issue.

Thank a lot.

phuong
  • 273
  • 1
  • 2
  • 10
  • I'm confused do you want to get the contents of the page it redirects to? or of the page that does the redirection? – Mohammad Ali Apr 02 '17 at 16:03
  • I think this is what you are looking for: http://stackoverflow.com/a/3556287/1699398 – jmercouris Apr 02 '17 at 17:59
  • @MohammadAli : I want to get the final page that it come to "https://blockexplorer.com/block/000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f" – phuong Apr 03 '17 at 03:52
  • @jmercouris: Follow your way, this just give me the orgigin url "https://blockexplorer.com/block-index/0". It's not that the url "https://blockexplorer.com/block/000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f" which contain the content that i wan – phuong Apr 03 '17 at 03:57
  • @JameLenon why is it that when I visit the link I am not redirected? – Mohammad Ali Apr 03 '17 at 05:01
  • @MohammadAli: when you go to https://blockexplorer.com/block-index/0, it will redirect to "https://blockexplorer.com/block/000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f". Let try it – phuong Apr 03 '17 at 08:26
  • @JameLenon I've added my own answer below, please try it – Mohammad Ali Apr 08 '17 at 16:53

2 Answers2

1

If you are willing to use the Python request module you could try the following code:

r = requests.get('https://blockexplorer.com/block-index/0', allow_redirects=True)

Which should give you the contents of the page after the request

Mohammad Ali
  • 878
  • 8
  • 16
0

The site you are trying to crawl does not accept the header */* (default for urllib), but accept text/html. You can crawl with the following code:

import urllib2

link = "http://blockexplorer.com/block-index/0"
r = urllib2.Request(url=link)
r.add_header('Accept', 'text/html')
response = urllib2.urlopen(r)
print(response.read())

But i think you will have more problems later. The data is not printed at html, but dynamically retrieved via javascript (angularJS).

Rafael
  • 1,835
  • 3
  • 18
  • 26
  • Thanks for your answer, but it's not work for me. I want to get the content same "https://blockexplorer.com/block/000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f" page. – phuong Apr 03 '17 at 03:25
  • But did you receive an HTML or the same `Cannot GET /block-index/0` ? – Rafael Apr 03 '17 at 09:57
  • If you are still receiving the same `Cannot GET /block-index/0`, try to [urllib2.urlopen like Chrome](https://gist.github.com/rafaelhdr/a05984cea8f929f29a0c3eb173f8dcdc). – Rafael Apr 03 '17 at 10:07
  • No, It's not appear the error "Cannot GET /block-index/0". But the content is so strangle, not like as I want after page load as this page https://blockexplorer.com/block/000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f%22 – phuong Apr 03 '17 at 11:39
  • Not HTML, right? It is an strange code with many interrogation marks (`?`)? If is it, this happens because of the gzip (from my gist). Just remove the gzip part (new header is `r.add_header('Accept-Encoding', 'deflate, sdch, br')`. – Rafael Apr 03 '17 at 12:44