How to read a few lines in a large CSV file with pandas?

Question

I have a CSV file that doesn't fit into my system's memory. Using Pandas, I want to read a small number of rows scattered all over the file.

I think that I can accomplish this without pandas following the steps here: How to read specific lines of a large csv file

In pandas, I am trying to use skiprows to select only the rows that I need.

# FILESIZE is the number of lines in the CSV file (~600M)
# rows2keep is an np.array with the line numbers that I want to read (~20)

rows2skip = (row for row in range(0,FILESIZE) if row not in rows2keep)
signal = pd.read_csv('train.csv', skiprows=rows2skip)

I would expect this code to return a small dataframe pretty fast. However, what is does is start consuming memory over several minutes until the system becomes irresponsive. I'm guessing that it is reading the whole dataframe first and will get rid of rows2skip later.

Why is this implementation so inefficient? How can I efficiently create a dataframe with only the lines specified in rows2keep?

Also try `modlin` in `pandas`. Heard it quite fast it reading large dataset!! — Rahul Agarwal, May 14 '19 at 08:17
it is new way of making your pandas read faster...just google it out — Rahul Agarwal, May 19 '19 at 08:43

score 3 · Accepted Answer · answered May 14 '19 at 08:15

3

Try this

train = pd.read_csv('file.csv', iterator=True, chunksize=150000)

If you only want to read the first n rows:

train = pd.read_csv(..., nrows=n)

If you only want to read rows from n to n+100

train = pd.read_csv(..., skiprows=n, nrows=n+100)

answered May 14 '19 at 08:15

Hrithik Puri

286
1
3
20

Thanks, this is actually what I am trying next :) But now I still need to filter the rows in the chunks somehow. I was wondering whether there is a more direct way? – ontheway May 14 '19 at 08:22
1

What i mean is apply the same rows2skip function to chunks i.e., train = pd.read_csv('file.csv', iterator=True, chunksize=150000, skiprows=rows2skip) – Hrithik Puri May 14 '19 at 10:52
I would expect this to work (a mixture of chunksize and skiprows), but it doesn't - it clogs memory the same. It seems that pandas is still somehow trying to load everything at once if we specify any rows to skip. – ontheway May 18 '19 at 10:18

score 0 · Answer 2 · answered May 14 '19 at 08:48

0

chunksize should help in limiting the memory usage. Alternatively, if you only need a few number of lines, a possible way is to first read the required lines ouside of pandas and then only feed read_csv with that subset. Code could be:

lines = [line for i, line in enumerate(open('train.csv')) if i in lines_to_keep]
signal = pd.read_csv(io.StringIO(''.join(lines)))

answered May 14 '19 at 08:48

Serge Ballesta

143,923
11
122
252

Thank you. This is a good idea, but the reading takes incredibly long. – ontheway May 18 '19 at 10:43

How to read a few lines in a large CSV file with pandas?

2 Answers2