How can I optimize file I/O in Python when I process GB-sized files via NFS?

Question

I'm manipulating several files via nfs, due to security concerns. The situation is very painful to process something due to slow file I/O. Followings are descriptions of the issue.

I use pandas in Python to do simple processing on data. So I use read_csv() and to_csv() frequently.
Currently, writing of a 10GB csv file requires nearly 30 mins whereas reading consumes 2 mins.
I have enough CPU cores (> 20 cores) and memory (50G~100G).
It is hard to ask more bandwidth.
I need to access data in column-oriented manner, frequently. For example, there would be 100M records with 20 columns (most of them are numeric data). For the data, I frequently read all of 100M records only for 3~4 columns' value.
I've tried with HDF5, but it constructs a larger file and consumes similar time to write. And it does not provide column-oriented I/O. So I've discarded this option.
I cannot store them locally. It would violate many security criteria. Actually I'm working on virtual machine and file system is mounted via nfs.
I repeatedly read several columns. For several columns, no. The task is something like data analysis.

Which approaches can I consider? In several cases, I use sqlite3 to manipulate data in simple way and exports results into csv files. Can I accelerate I/O tasks by using sqlite3 in Python? If it provide column-wise operation, it would be a good solution, I reckon.

Do you precess the data repeatedly? Could you store it on local disk temporarily for the duration of this processing? It's not clear what you would use SQLite for or why you think it could help. — tripleee, Jan 29 '20 at 03:11
@tripleee Thanks for the comment. I've updated the question. — overfit, Jan 29 '20 at 03:32
If you cannot use local storage then SQLite3 probably won't help much, though an in-memory database might be useful for the analysis itself if you are doing repeated joins or etc. If you have 10x the memory of the input file then just read it once and keep it in core. — tripleee, Jan 29 '20 at 03:37
@tripleee Thanks a lot. But it looks like different story slightly. In-memory-ness is not a concern in the given situation actually. I can load all data into memory, but the problem is the slowness of file I/O... I want to store the result on the nfs. — overfit, Jan 29 '20 at 03:46
The requirement to use NFS seems non-negotiable and if that's how long it takes, that's how long it takes. — tripleee, Jan 29 '20 at 03:49
@tripleee If I can update data in column-wise manner, it can reduce my writing time to 1~2 mins from 30 mins. That's what I'm considering. — overfit, Jan 29 '20 at 06:53

score 0 · Answer 1 · answered Jan 29 '20 at 03:52

0

two options: pandas hdf5 or dask.

you can review hdf5 format with format='table'.

HDFStore supports another PyTables format on disk, the table format. Conceptually a table is shaped very much like a DataFrame, with rows and columns. A table may be appended to in the same or other sessions. In addition, delete and query type operations are supported. This format is specified by format='table' or format='t' to append or put or to_hdf.

you can use dask read_csv. it read data only when execute()

For purely improve IO performance, i think hdf with compress format is best.

answered Jan 29 '20 at 03:52

Yong Wang

1,200
10
15

1. I already tried HDFStore, with format='table', but there were no improvements at all. 2. This solution also has no effect since anyway it has to traverse every record row-wise. Maybe hdf with compression can be an option but in this time I decided not to go with it because the expectation of speed-up is not so significant. Thanks a lot. – overfit Jan 30 '20 at 05:35
when hdf format is "table", you can improve the I/O performance signifanctly. For write, you can append in batch instead write file in once. For read, you can select desired columns or desired rows(in case of you use data column for the desired columns). https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#hdf5-pytables – Yong Wang Jan 30 '20 at 06:09
I've already profiled the performance of HDF with 'table'. But it showed little improvement. If it is possible to append one column to existing DataFrame in HDFStore, it would be great though. https://stackoverflow.com/questions/20428355/appending-column-to-frame-of-hdf-file-in-pandas/20428786 However, the above link says it isn't the case. You need to recognize the difference between row-oriented and column-oriented data structures. – overfit Jan 31 '20 at 06:26
write(append) is row based. read should be query on your condition per rows or column. basically, it should be enough for most of cases. column based append is key-value database solution. is it not out the topic. – Yong Wang Jan 31 '20 at 12:00

How can I optimize file I/O in Python when I process GB-sized files via NFS?

1 Answers1