178

I'm trying to unzip a csv file and pass it into pandas so I can work on the file.
The code I have tried so far is:

import requests, zipfile, StringIO
r = requests.get('http://data.octo.dc.gov/feeds/crime_incidents/archive/crime_incidents_2013_CSV.zip')
z = zipfile.ZipFile(StringIO.StringIO(r.content))
crime2013 = pandas.read_csv(z.read('crime_incidents_2013_CSV.csv'))

After the last line, although python is able to get the file, I get a "does not exist" at the end of the error.

Can someone tell me what I'm doing incorrectly?

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
user2793667
  • 1,781
  • 2
  • 11
  • 3

6 Answers6

272

If you want to read a zipped or a tar.gz file into pandas dataframe, the read_csv methods includes this particular implementation.

df = pd.read_csv('filename.zip')

Or the long form:

df = pd.read_csv('filename.zip', compression='zip', header=0, sep=',', quotechar='"')

Description of the compression argument from the docs:

compression : {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’ For on-the-fly decompression of on-disk data. If ‘infer’ and filepath_or_buffer is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’ (otherwise no decompression). If using ‘zip’, the ZIP file must contain only one data file to be read in. Set to None for no decompression.

New in version 0.18.1: support for ‘zip’ and ‘xz’ compression.

rjurney
  • 4,824
  • 5
  • 41
  • 62
Suchit
  • 2,829
  • 1
  • 10
  • 3
  • 8
    There isn't support for zipped files, only gzip and bz2. This is irritating, because zip is pretty common. I imagine this is because zip isn't open source? – T.C. Proctor Oct 15 '15 at 16:28
  • 35
    zip is now supported in pandas 0.18.1 – krackoder May 05 '16 at 02:16
  • 1
    This solutions works for a gzipped file but not for .tar.gz files (Pandas 0.19.2) Tar.gz is not supported by Pandas! See: https://github.com/pandas-dev/pandas/issues/12005#issuecomment-170277346 – tector Jun 14 '17 at 12:33
  • Could you please tell us whether there is any particular reason to use `quotechar`? – Herpes Free Engineer Mar 27 '18 at 16:24
  • This answer shows a `.tar.gz` file, but it probably only works with a `.gz` file. – William Entriken Apr 01 '18 at 02:45
  • To add few tips on this answer. As of `pandas 0.24.1`, these compression types are supported `compression : {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’`. And you need to have only the csv file in the compressed format. I use this in Mac to compress. Type in command line `zip -r9 your_zip_file.zip your_file.csv` – addicted Feb 06 '19 at 07:56
  • 1
    @addicted, I've updated the answer with the new compression formats – Pierre H. Feb 28 '19 at 15:04
  • Edited to be current and clarified the simple API for zip support. – rjurney Aug 24 '19 at 20:57
  • How would you expect it to work for a `tar.gz` file? If multiple csv files are tar'd, which one should pandas open? – dslack Feb 23 '20 at 01:59
  • any chance this works if the zipfile is password protected? – PV8 Mar 06 '20 at 12:47
  • apparently the read_excel method doesn't work the same way. My zip file is online and when I open it, its got a file with an .xlsx extension so therefore read_csv doens't like it. – leeprevost Apr 15 '20 at 16:31
  • I just got a BSOD in windows 10 by trying to read a huge compressed (with .xz) CSV file with read_csv; watch out! – tobi delbruck Jun 07 '23 at 06:57
48

I think you want to open the ZipFile, which returns a file-like object, rather than read:

In [11]: crime2013 = pd.read_csv(z.open('crime_incidents_2013_CSV.csv'))

In [12]: crime2013
Out[12]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24567 entries, 0 to 24566
Data columns (total 15 columns):
CCN                            24567  non-null values
REPORTDATETIME                 24567  non-null values
SHIFT                          24567  non-null values
OFFENSE                        24567  non-null values
METHOD                         24567  non-null values
LASTMODIFIEDDATE               24567  non-null values
BLOCKSITEADDRESS               24567  non-null values
BLOCKXCOORD                    24567  non-null values
BLOCKYCOORD                    24567  non-null values
WARD                           24563  non-null values
ANC                            24567  non-null values
DISTRICT                       24567  non-null values
PSA                            24567  non-null values
NEIGHBORHOODCLUSTER            24263  non-null values
BUSINESSIMPROVEMENTDISTRICT    3613  non-null values
dtypes: float64(4), int64(1), object(10)
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • 4
    Note: you can parse the date columns when reading: `pd.read_csv(z.open('crime_incidents_2013_CSV.csv'), parse_dates=['REPORTDATETIME', 'LASTMODIFIEDDATE'])` – Andy Hayden Sep 19 '13 at 02:42
  • 1
    To read the first file: `pd.read_csv(z.open(z.infolist()[0].filename))` – user3226167 Sep 15 '17 at 10:37
31

It seems you don't even have to specify the compression any more. The following snippet loads the data from filename.zip into df.

import pandas as pd
df = pd.read_csv('filename.zip')

(Of course you will need to specify separator, header, etc. if they are different from the defaults.)

Tapa Dipti Sitaula
  • 712
  • 1
  • 8
  • 13
25

For "zip" files, you can use import zipfile and your code will be working simply with these lines:

import zipfile
import pandas as pd
with zipfile.ZipFile("Crime_Incidents_in_2013.zip") as z:
   with z.open("Crime_Incidents_in_2013.csv") as f:
      train = pd.read_csv(f, header=0, delimiter="\t")
      print(train.head())    # print the first 5 rows

And the result will be:

X,Y,CCN,REPORT_DAT,SHIFT,METHOD,OFFENSE,BLOCK,XBLOCK,YBLOCK,WARD,ANC,DISTRICT,PSA,NEIGHBORHOOD_CLUSTER,BLOCK_GROUP,CENSUS_TRACT,VOTING_PRECINCT,XCOORD,YCOORD,LATITUDE,LONGITUDE,BID,START_DATE,END_DATE,OBJECTID
0  -77.054968548763071,38.899775938598317,0925135...                                                                                                                                                               
1  -76.967309569035052,38.872119553647011,1003352...                                                                                                                                                               
2  -76.996184958456539,38.927921847721443,1101010...                                                                                                                                                               
3  -76.943077541353617,38.883686046653935,1104551...                                                                                                                                                               
4  -76.939209158039446,38.892278093281632,1125028...
imanzabet
  • 2,752
  • 2
  • 26
  • 19
  • 4
    This is required if you have multiple files in the zipfile.. at least, this is what I have been able to get to work. – blacktj Jan 14 '21 at 14:49
7

I guess what your looking is the following

from io import BytesIO
import requests
import pandas as pd

result = requests.get("https://www.xxx.zzz/file.zip")
df = pd.read_csv(BytesIO(result.content),compression='zip', header=0, sep=',', quotechar='"')

Read these article to understand why: https://medium.com/dev-bits/ultimate-guide-for-working-with-i-o-streams-and-zip-archives-in-python-3-6f3cf96dca50

MajorDaxx
  • 131
  • 1
  • 4
2

https://www.kaggle.com/jboysen/quick-gz-pandas-tutorial

Please follow this link.

import pandas as pd
traffic_station_df = pd.read_csv('C:\\Folders\\Jupiter_Feed.txt.gz', compression='gzip',
                                 header=1, sep='\t', quotechar='"')

#traffic_station_df['Address'] = 'address'

#traffic_station_df.append(traffic_station_df)
print(traffic_station_df)
  • Welcome to Stack Overflow! While this code may answer the question, providing additional context either as comments with the code or as a separate paragraph regarding how and/or why it solves the problem would improve the answer's long-term value. – Sardar Usama Jul 04 '20 at 07:50