3

I have a CSV file that doesn't fit into my system's memory. Using Pandas, I want to read a small number of rows scattered all over the file.

I think that I can accomplish this without pandas following the steps here: How to read specific lines of a large csv file

In pandas, I am trying to use skiprows to select only the rows that I need.

# FILESIZE is the number of lines in the CSV file (~600M)
# rows2keep is an np.array with the line numbers that I want to read (~20)

rows2skip = (row for row in range(0,FILESIZE) if row not in rows2keep)
signal = pd.read_csv('train.csv', skiprows=rows2skip)

I would expect this code to return a small dataframe pretty fast. However, what is does is start consuming memory over several minutes until the system becomes irresponsive. I'm guessing that it is reading the whole dataframe first and will get rid of rows2skip later.

Why is this implementation so inefficient? How can I efficiently create a dataframe with only the lines specified in rows2keep?

ontheway
  • 125
  • 2
  • 7

2 Answers2

3

Try this

train = pd.read_csv('file.csv', iterator=True, chunksize=150000)

If you only want to read the first n rows:

train = pd.read_csv(..., nrows=n)

If you only want to read rows from n to n+100

train = pd.read_csv(..., skiprows=n, nrows=n+100)
Hrithik Puri
  • 286
  • 1
  • 3
  • 20
  • Thanks, this is actually what I am trying next :) But now I still need to filter the rows in the chunks somehow. I was wondering whether there is a more direct way? – ontheway May 14 '19 at 08:22
  • 1
    What i mean is apply the same rows2skip function to chunks i.e., train = pd.read_csv('file.csv', iterator=True, chunksize=150000, skiprows=rows2skip) – Hrithik Puri May 14 '19 at 10:52
  • I would expect this to work (a mixture of chunksize and skiprows), but it doesn't - it clogs memory the same. It seems that pandas is still somehow trying to load everything at once if we specify any rows to skip. – ontheway May 18 '19 at 10:18
0

chunksize should help in limiting the memory usage. Alternatively, if you only need a few number of lines, a possible way is to first read the required lines ouside of pandas and then only feed read_csv with that subset. Code could be:

lines = [line for i, line in enumerate(open('train.csv')) if i in lines_to_keep]
signal = pd.read_csv(io.StringIO(''.join(lines)))
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252