I had been working on a project where I had to read and process very large csv files with millions of rows as fast as possible.
I came across the link: https://nelsonslog.wordpress.com/2015/02/26/python-csv-benchmarks/ where the author has benchmarked different ways of accessing csv and the time taken for each step. He has used a catdevnull process with the code as shown:
def catDevNull():
os.system('cat %s > /dev/null' % fn)
The time taken in this case is the least. I believe it is independent of the python version as the time taken to read the file remains the same. Then he utilizes the warmc ache method as shown:
def wc():
os.system('wc -l %s > /dev/null' % fn)
The above two methods are the fastest. Using pandas.read_csv
for the task, the time is less than other methods, but still slower than the above two methods.
Putting x = os.system('cat %s > /dev/null % fn)
, and checking the data type is a string.
How does os.system
read the file that the time is so much less? Also, is there a way to access the files after they are read by os.system
for further processing?
I was also curious as to how come reading the file is so much faster in pandas compared to other methods available as shown in the above link?