2

I'm just getting to grips with pandas (which is awesome) and what I need to do is read in compressed genomics type files from ftp sites into a pandas dataframe. This is what I tried and got a ton of errors:

from pandas.io.parsers import *

chr1 = 'ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/chr_rpts/chr_1.txt.gz'

CHR1 = read_csv(chr1, sep='\t', compression = 'gzip', skiprows = 10)

print type(CHR1)
print CHR1.head(10)

Ideally I'd like to do something like this:

from pandas.io.data import *
AAPL = DataReader('AAPL', 'yahoo', start = '01/01/2006')
Cath Penfold
  • 281
  • 1
  • 3
  • 6
  • I don't think pandas are intelligent enough to retrieve files using FTP. –  Feb 18 '13 at 21:01

1 Answers1

1

The interesting part of this question is how to stream a (gz) file from ftp, which is discussed here, where it's claimed that the following will work in Python 3.2 (but won't in 2.x, nor will it be backported), and on my system this is the case:

import urllib.request as ur
from gzip import GzipFile

req = ur.Request(chr1) #  gz file on ftp (ensure startswith 'ftp://')
z_f = ur.urlopen(req)

# this line *may* work (but I haven't been able to confirm it)
# df = pd.read_csv(z_f, sep='\t', compression='gzip', skiprows=10)

# this works (*)
f = GzipFile(fileobj=z_f, mode="r")
df = pd.read_csv(f, sep='\t', skiprows=10)

(*) Here f is "file-like", in the sense that we can perform a readline (read it line-by-line), rather than having to download/open the entire file.

.

Note: I couldn't get the ftplib library to readline, it wasn't clear whether it ought to.

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535