I have many .7z
files every file containing many large CSV
files (more than 1GB). How can I read this in python (especially pandas and dask data frame)? Should I change the compression format to something else?
Asked
Active
Viewed 2,982 times
1 Answers
0
I believe you should be able to open the file using
import lzma
with lzma.open("myfile.7z", "r") as f:
df = pd.read_csv(f, ...)
This is, strictly speaking, meant for the xz
file format, but may work for 7z also. If not, you will need to use libarchive.
For use with Dask, you can do the above for each file with dask.delayed
.
dd.read_csv
directly also allows you to specify storage_options={'compression': 'xz'}
; however, ramdom access within a file is likely to be inefficient at best, so you should add blocksize=None
to force one partition per file:
df = dd.read_csv('myfiles.*.7z', storage_options={'compression': 'xz'},
blocksize=None)

mdurant
- 27,272
- 5
- 45
- 74