0

I would like to download a txt file from a page using Urlretrieve. However it sometioms worked well, but most of the time simply downloaded unreadable text.

Following is my code and the site:

import urllib
import os,sys


opener = urllib.request.build_opener() 
opener.addheaders = [('User-Agent','Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36')] 
urllib.request.install_opener(opener) 
url='http://www.17500.cn/getData/ssq.TXT'
try: 
    urllib.request.urlretrieve(url, os.getcwd()+'/data/data - all.txt') 
except urllib.error.HTTPError as e: 
    print('failure')

However, if I open the data - all.tex, I get ? Y?K?堽??R逆a{PU类,憕7>翰*嬊蓀傛0@?瑫襅?威J鸰?迭怔W踎?m?邒?纯?я?锖束+鳢^祸讀?茔?頬

eyllanesc
  • 235,170
  • 19
  • 170
  • 241
ZHANG Juenjie
  • 501
  • 5
  • 20
  • What do you mean by unreadable text? Can you post some examples? – drum Apr 04 '18 at 04:06
  • for example: ? Y?K?堽??R逆a{PU类,憕7>翰*嬊蓀傛0@?瑫襅?威J鸰?迭怔W踎?m?邒?纯?я?锖束+鳢^祸讀?茔?頬 – ZHANG Juenjie Apr 04 '18 at 04:22
  • Why do not `wget http://www.17500.cn/getData/ssq.TXT -O out` – NVRM Apr 04 '18 at 04:28
  • 1
    @Crptopat. I tried your suggestion, still I get the same messy word. But truly I I open this url in my browser – ZHANG Juenjie Apr 04 '18 at 04:53
  • It seems like the website is somehow preventing users from automating data scraping. When going through browser, the data renders fine. However going through tools such as wget, curl, or python, the site will sometimes return garbage. – drum Apr 04 '18 at 04:53
  • @drum. Then How can I solve this? – ZHANG Juenjie Apr 04 '18 at 04:56
  • Can try checking if the data is garbage or properly formatted. I can't think of a good way. – drum Apr 04 '18 at 05:00
  • 1
    The following code worked wget --header='Accept: text/html' http://www.17500.cn/getData/ssq.TXT -O out.txt – ZHANG Juenjie Apr 04 '18 at 05:38
  • @ZHANGJuenjie Do not edit your title to indicate that your problem has been solved, publish your solution as an answer and mark it as correct, so it must be done in SO to indicate that a problem has been solved. – eyllanesc Apr 04 '18 at 05:41

0 Answers0