Python: downloading a file that resists usual techniques

Question

I am trying to write a python code to download and save a file from this url: http://obiee.banrep.gov.co/analytics/saw.dll?Download&Format=excel&Extension=.xls&BypassCache=true&lang=es&NQUser=publico&NQPassword=publico&Path=/shared/Consulta%20Series%20Estadisticas%20desde%20Excel/1.%20IPC%20base%202008/1.3.%20Por%20rango%20de%20fechas/1.3.2.%20Por%20grupo%20de%20gasto&ViewState=h09v965dvurdtkj0iuni7m1kbe&ContainerID=o%3ago%7er%3areport&RootViewID=go

The expected result should be to download and save the served Excel file.

The file is behind some sort of oracle database. The file downloads fine using any browser. "Live HTTP headers" firefox extension tells me it's a GET request. Anyway I've tried usual techniques and I always end up downloading "saw.dll", which is a simple xml file and not the expected Excel file.

Here's what I tried:

 import urllib,urlib2,shutil

 url = 'http://obiee.banrep.gov.co/analytics/saw.dll?Download'
 values = {
   'Format' : 'excel',
   'Extension' : '.xls',
   'BypassCache' : 'true',
   'lang' : 'es',
   'NQUser' : 'publico',
   'NQPassword' : 'publico',
   'Path' : '/shared/Consulta Series Estadisticas desde Excel/1. IPC base 2008/1.3. Por rango de fechas/1.3.2. Por grupo de gasto',
   'ViewState' : 'h09v965dvurdtkj0iuni7m1kbe',
   'ContainerID' : 'o%3ago%7er%3areport',
   'RootViewID' : 'go',
}

data = urllib.urlencode(values)
req = urllib2.Request(url,data)
response = urllib2.urlopen(req)
myfile = open('test.xls', 'wb')
shutil.copyfileobj(response.fp, myfile)
myfile.close()

Other code I tried:

import requests,shutil

response = requests.get("http://obiee.banrep.gov.co/analytics/saw.dll?Download&Format=excel&Extension=.xls&BypassCache=true&lang=es&NQUser=publico&NQPassword=publico&Path=/shared/Consulta%20Series%20Estadisticas%20desde%20Excel/1.%20IPC%20base%202008/1.3.%20Por%20rango%20de%20fechas/1.3.2.%20Por%20grupo%20de%20gasto&ViewState=h09v965dvurdtkj0iuni7m1kbe&ContainerID=o%3ago%7er%3areport&RootViewID=go",stream=True)

with open('test.xls', 'wb') as out_file:
    shutil.copyfileobj(response.raw, out_file)
del response

I also tried other stuff such as using wget, putting some delay between the request and the saving, etc.

Any ideas ?

Thanks, best.

.xls is an XML format... I don't suppose you've tried opening the file in excel? — Adam Barnes, Nov 11 '16 at 16:59
I did but the expected file is "1.3.2. Por grupo de gasto.xls" which is a data file. Opening saw.dll (which is the file my codes actually downloads) in Excel works but its just a plain xml file that I don't need... — benzineengine, Nov 11 '16 at 17:02

score 2 · Accepted Answer · answered Nov 11 '16 at 17:33

2

Did you tried to change the user agent?

...
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
requests.get(url=url, stream=True, headers=headers)

Maybe the server return different responses to different user agents.

answered Nov 11 '16 at 17:33

Jean Cassol

483
5
8

Actually I tried doing something like that using `user_agent = 'Mozilla 5.0 (Windows 7; Win64; x64)'` `file_name = "test.xls"` `u = urllib2.Request(url, headers = {'User-Agent' : user_agent})` But to no luck... But your suggestion did work for me using the user agent you suggest !! So many thanks !!! – benzineengine Nov 11 '16 at 17:49
Did changing the user-agent work for you? I tried it and didn't have any luck. – tdelaney Nov 11 '16 at 17:49
That's interesting. "Mozilla/5.0" wasn't sufficient but "'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1)'" was. Looks like you have to more chatty than I expected. – tdelaney Nov 11 '16 at 17:58

score 0 · Answer 2 · edited May 23 '17 at 12:07

This code actually works for me:

import requests,shutil

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response=requests.get(url='http://obiee.banrep.gov.co/analytics/saw.dll?Download&Format=excel&Extension=.xls&BypassCache=true&lang=es&NQUser=publico&NQPassword=publico&Path=/shared/Consulta%20Series%20Estadisticas%20desde%20Excel/1.%20IPC%20base%202008/1.3.%20Por%20rango%20de%20fechas/1.3.2.%20Por%20grupo%20de%20gasto&ViewState=h09v965dvurdtkj0iuni7m1kbe&ContainerID=o%3ago%7er%3areport&RootViewID=go', stream=True, headers=headers)
with open('test.xls', 'wb') as out_file:
    shutil.copyfileobj(response.raw, out_file)
del response

This is the suggested answer by Jean Cassol above. Many thanks

Python: downloading a file that resists usual techniques

2 Answers2