3

I would like to automate the download of CSV files from the World Bank's dataset.

My problem is that the URL corresponding to a specific dataset does not lead directly to the desired CSV file but is instead a query to the World Bank's API. As an example, this is the URL to get the GDP per capita data: http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv.

If you paste this URL in your browser, it will automatically start the download of the corresponding file. As a consequence, the code I usually use to collect and save CSV files in Python is not working in the present situation:

baseUrl = "http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv"
remoteCSV = urllib2.urlopen("%s" %(baseUrl))
myData = csv.reader(remoteCSV)

How should I modify my code in order to download the file coming from the query to the API?

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
SirC
  • 2,101
  • 4
  • 19
  • 24

4 Answers4

3

This will get the zip downloaded, open it and get you a csv object with whatever file you want.

import urllib2
import StringIO
from zipfile import ZipFile
import csv

baseUrl = "http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv"
remoteCSV = urllib2.urlopen(baseUrl)

sio = StringIO.StringIO()
sio.write(remoteCSV.read())
    # We create a StringIO object so that we can work on the results of the request (a string) as though it is a file.

z = ZipFile(sio, 'r')
    # We now create a ZipFile object pointed to by 'z' and we can do a few things here:

print z.namelist()
    # A list with the names of all the files in the zip you just downloaded
    # We can use z.namelist()[1] to refer to 'ny.gdp.pcap.cd_Indicator_en_csv_v2.csv'

with z.open(z.namelist()[1]) as f:
# Opens the 2nd file in the zip
    csvr = csv.reader(f)
    for row in csvr:
        print row

For more information see ZipFile Docs and StringIO Docs

MrAlexBailey
  • 5,219
  • 19
  • 30
  • Thanks, it works and it has solved my problem and I have learnt something new about StringIO library. – SirC Mar 20 '15 at 15:55
2
import os
import urllib
import zipfile
from StringIO import StringIO

package = StringIO(urllib.urlopen("http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv").read())
zip = zipfile.ZipFile(package, 'r')
pwd = os.path.abspath(os.curdir)

for filename in zip.namelist():
    csv = os.path.join(pwd, filename)
    with open(csv, 'w') as fp:
        fp.write(zip.read(filename))
    print filename, 'downloaded successfully'

From here you can use your approach to handle CSV files.

Mauro Baraldi
  • 6,346
  • 2
  • 32
  • 43
  • 1
    Thank you, both answers work fine, I flagged the other one as the one answering my question just because it is a bit more "didactic". – SirC Mar 20 '15 at 15:57
1

We have a script to automate access and data extraction for World Bank World Development Indicators like: https://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS

The script does the following:

  1. Downloading the metadata data
  2. Extracting metadata and data
  3. Converting to a Data Package

The script is python based and uses python 3.0. It has no dependencies outside of the standard library. Try it:

python scripts/get.py

python scripts/get.py https://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS

You also can read our analysis about data from World Bank:

https://datahub.io/awesome/world-bank

anuveyatsu
  • 125
  • 1
  • 7
-1

Just a suggestion than a solution. You can use pd.read_csv to read any csv file directly from a URL.

import pandas as pd
data = pd.read_csv('http://url_to_the_csv_file')
Kathirmani Sukumar
  • 10,445
  • 5
  • 33
  • 34