What are the different ways to access really large csv files?

Question

I had been working on a project where I had to read and process very large csv files with millions of rows as fast as possible.

I came across the link: https://nelsonslog.wordpress.com/2015/02/26/python-csv-benchmarks/ where the author has benchmarked different ways of accessing csv and the time taken for each step. He has used a catdevnull process with the code as shown:

def catDevNull():
    os.system('cat %s > /dev/null' % fn)

The time taken in this case is the least. I believe it is independent of the python version as the time taken to read the file remains the same. Then he utilizes the warmc ache method as shown:

def wc():
    os.system('wc -l %s > /dev/null' % fn)

The above two methods are the fastest. Using pandas.read_csv for the task, the time is less than other methods, but still slower than the above two methods.

Putting x = os.system('cat %s > /dev/null % fn), and checking the data type is a string.

How does os.system read the file that the time is so much less? Also, is there a way to access the files after they are read by os.system for further processing?

I was also curious as to how come reading the file is so much faster in pandas compared to other methods available as shown in the above link?

store them or push them into a database for example mysql, sqlite, MS SQL. It is easier to work with when they are in the database. — MEdwin, Jan 15 '19 at 11:21
BTW, `os.system` should be avoided; nowadays it is much better to use the `subprocess`module. But for the tasks given here, there are surely native functions which can be used w/o shell calls. — glglgl, Jan 15 '19 at 11:21
Storing the data in the database takes a lot of time. For a standalone application, to process data with a large number of rows, I found pandas read csv method to be the fastest. But, based on the blog post, the time using the os method was a lot lesser. I wanted to know how does it work, and is it possible to access the data when it is read through os.system. You can find the details in the blog post mentioned. — Phoenix, Jan 15 '19 at 11:26
i disagree that storing data in a database takes alot of time. try bulk import. mysqldump, SSIS is used even for weather, finanacial data with huge number of rows. — MEdwin, Jan 15 '19 at 11:38
`catDevNull` just reads the file. It does not parse it or even return the output. It's sort of a baseline benchmark to allow to determine the time it takes to read the file itself. — kfx, Jan 15 '19 at 11:38
The wc and catDevNull method are fast because they make no attempt to properly parse the csv file. wc parsed only the lines, but not the commas. When you actually try to parse the file into a data structure, you'll have to allocate space for the structures and handle escape characters and quotes. I second the other's suggestions to dump this into a database. Initial loading to a database may not be faster than reading to csv, but it'll be much faster when you actually start needing to process the data because you can create indexes to speed up further accesses and manipulations. — Lie Ryan, Jan 15 '19 at 11:55

score 3 · Answer 1 · answered Jan 15 '19 at 11:37

os.system completely relinquishes the control you have in Python. There is no way to access anything which happened in the subprocess after it has finished.

A better way to have some (but not sufficient) control over a subprocess is to use the Python subprocess module. This allows you to interact with the running process using signals and I/O, but still, there is no way to affect the internals of a process unless it has a specific API for allowing you to do that. (Linux exposes some process internals in the /proc filesystem if you want to explore that.)

I don't think you understand what the benchmark means. The cat >/dev/null is a baseline which simply measures how quickly the system is able to read the file off the disk; your process cannot possibly be faster than the I/O channel permits, so this is the time that the system takes for doing nothing at all. You would basically subtract this time from the subsequent results before you compare their relative performance.

Conventionally, the absolutely fastest way to read a large file is to index it, then use the index in memory to seek to the position inside the file you want to access. Building the index causes some overhead, but if you access the file more than once, the benefits soon cancel out the overhead. Importing your file to a database is a convenient and friendly way to do this; the database encapsulates the I/O completely and lets you query the data as if you could ignore that it is somehow serialized into bytes on a disk behind the scenes.

I totally agree with this comment. importing a huge csv file into a database and querying it, is fast and give you alot of flexibility with the query logic. — MEdwin, Jan 15 '19 at 11:43
Though [the alot](https://knowyourmeme.com/memes/the-alot) is not usually made of flexibility. There is one made from beer cans which is quite popular. — tripleee, Jan 15 '19 at 11:56
That is a great method. On further experimentation, I found out that the best way to index the files is through using pandas dataframe rather than using sql, because of the speed of querying and storing — Phoenix, Mar 15 '19 at 12:29

score 0 · Answer 2 · answered Mar 15 '19 at 12:27

Based on my testing. I came across the fact that it is a lot faster to query in a pandas dataframe than querying in the database[tested for sqlite3]

Thus, the fastest way is to get the csv as a pandas dataframe, and then query in the dataframe as required. Also, if I need to save the file, I can pickle the dataframe, and reuse it as required. The time to pickle and unpickle file and querying is a lot lesser than storing the data in sql and then querying for the results.

What are the different ways to access really large csv files?

2 Answers2