3

I have many .7z files every file containing many large CSV files (more than 1GB). How can I read this in python (especially pandas and dask data frame)? Should I change the compression format to something else?

Eghbal
  • 3,892
  • 13
  • 51
  • 112

1 Answers1

0

I believe you should be able to open the file using

import lzma
with lzma.open("myfile.7z", "r") as f:
    df = pd.read_csv(f, ...)

This is, strictly speaking, meant for the xz file format, but may work for 7z also. If not, you will need to use libarchive.

For use with Dask, you can do the above for each file with dask.delayed. dd.read_csv directly also allows you to specify storage_options={'compression': 'xz'}; however, ramdom access within a file is likely to be inefficient at best, so you should add blocksize=None to force one partition per file:

df = dd.read_csv('myfiles.*.7z', storage_options={'compression': 'xz'},
                 blocksize=None)
mdurant
  • 27,272
  • 5
  • 45
  • 74