I'm manipulating several files via nfs, due to security concerns. The situation is very painful to process something due to slow file I/O. Followings are descriptions of the issue.
- I use pandas in Python to do simple processing on data. So I use
read_csv()
andto_csv()
frequently. - Currently, writing of a 10GB csv file requires nearly 30 mins whereas reading consumes 2 mins.
- I have enough CPU cores (> 20 cores) and memory (50G~100G).
- It is hard to ask more bandwidth.
- I need to access data in column-oriented manner, frequently. For example, there would be 100M records with 20 columns (most of them are numeric data). For the data, I frequently read all of 100M records only for 3~4 columns' value.
- I've tried with HDF5, but it constructs a larger file and consumes similar time to write. And it does not provide column-oriented I/O. So I've discarded this option.
- I cannot store them locally. It would violate many security criteria. Actually I'm working on virtual machine and file system is mounted via nfs.
- I repeatedly read several columns. For several columns, no. The task is something like data analysis.
Which approaches can I consider? In several cases, I use sqlite3 to manipulate data in simple way and exports results into csv files. Can I accelerate I/O tasks by using sqlite3 in Python? If it provide column-wise operation, it would be a good solution, I reckon.