11

If I have a csv file that's too large to load into memory with pandas (in this case 35gb), I know it's possible to process the file in chunks, with chunksize.

However I want to know if it's possible to change chunksize based on values in a column.

I have an ID column, and then several rows for each ID with information, like this:

ID,   Time,  x, y
sasd, 10:12, 1, 3
sasd, 10:14, 1, 4
sasd, 10:32, 1, 2
cgfb, 10:02, 1, 6
cgfb, 10:13, 1, 3
aenr, 11:54, 2, 5
tory, 10:27, 1, 3
tory, 10:48, 3, 5
ect...

I don't want to separate IDs into different chunks. for example chunks of size 4 would be processed:

ID,   Time,  x, y
sasd, 10:12, 1, 3
sasd, 10:14, 1, 4
sasd, 10:32, 1, 2
cgfb, 10:02, 1, 6
cgfb, 10:13, 1, 3 <--this extra line is included in the 4 chunk

ID,   Time,  x, y
aenr, 11:54, 2, 5
tory, 10:27, 1, 3
tory, 10:48, 3, 5
...

Is it possible?

If not perhaps using the csv library with a for loop along the lines of:

for line in file:
    x += 1
    if x > 1000000 and curid != line[0]:
        break
    curid = line[0]
    #code to append line to a dataframe

although I know this would only create one chunk, and for loops take a long time to process.

Josh Kidd
  • 816
  • 2
  • 14
  • 35
  • Where is your file stored? One way would be to filter on ID beforehand (maybe, if you're pulling a database extract, using `SELECT * FROM ... WHERE ID == BLAH`), creating a different file for each unique ID value. – blacksite Feb 14 '17 at 14:34
  • Unfortunately that's not possible, I don't have access to the database – Josh Kidd Feb 14 '17 at 14:51
  • This might be a PITA, but I think it'd work: what if you tried using `chunksize` right now, streaming through the entire 35gb file, and creating an individual CSV for each unique value of ID (`set(df['ID'])`)? Then, for each row in your larger file, you write (read: append) that row to the existing ID file corresponding to that row's ID? It wouldn't be too hard to code, although I'm doubtful about how long it might take... Just a thought! Otherwise, I'm afraid I can't help any further. – blacksite Feb 14 '17 at 14:59

2 Answers2

8

If you iterate through the csv file line by line, you can yield chunks with a generator dependent on any column.

Working example:

import pandas as pd

def iter_chunk_by_id(file):
    csv_reader = pd.read_csv(file, iterator=True, chunksize=1, header=None)
    first_chunk = csv_reader.get_chunk()
    id = first_chunk.iloc[0,0]
    chunk = pd.DataFrame(first_chunk)
    for l in csv_reader:
        if id == l.iloc[0,0]:
            id = l.iloc[0,0]
            chunk = chunk.append(l)
            continue
        id = l.iloc[0,0]
        yield chunk
        chunk = pd.DataFrame(l)
    yield chunk

## data.csv ##
# 1, foo, bla
# 1, off, aff
# 2, roo, laa
# 3, asd, fds
# 3, qwe, tre
# 3, tre, yxc   

chunk_iter = iter_chunk_by_id("data.csv")

for chunk in chunk_iter:
    print(chunk)
    print("_____")

Output:

   0     1     2
0  1   foo   bla
1  1   off   aff
_____
   0     1     2
2  2   roo   laa
3  2   jkl   xds
_____
   0     1     2
4  3   asd   fds
5  3   qwe   tre
6  3   tre   yxc
_____
elcombato
  • 473
  • 1
  • 4
  • 16
  • So this will create a chunk for each id, how would I instead create larger chunks of size 1000000, and then append the leftover rows of the same id, to ensure rows of the same id aren't in separate chunks? – Josh Kidd Feb 17 '17 at 16:55
  • @JoshuaKidd did you comprehend the function `iter_chunk_by_id()`? You can modify it easily to correspond to your new request: add a `or` condition in the line `if id == l.iloc[0,0]`, to continue adding lines to the chunk if the length of the chunk is smaller than 1000000 (btw: why did you remove my answer as accepted?) – elcombato Feb 20 '17 at 14:09
1

I built on the answer provided by @elcombato to take any chunk size. I actually had a similar use case and processing each line one by one made my program unbearably slow

def iter_chunk_by_id(file_name, chunk_size=10000):
"""generator to read the csv in chunks of user_id records. Each next call of generator will give a df for a user"""

csv_reader = pd.read_csv(file_name, compression='gzip', iterator=True, chunksize=chunk_size, header=0, error_bad_lines=False)
chunk = pd.DataFrame()
for l in csv_reader:
    l[['id', 'everything_else']] = l[
        'col_name'].str.split('|', 1, expand=True)
    hits = l['id'].astype(float).diff().dropna().nonzero()[0]
    if not len(hits):
        # if all ids are same
        chunk = chunk.append(l[['col_name']])
    else:
        start = 0
        for i in range(len(hits)):
            new_id = hits[i]+1
            chunk = chunk.append(l[['col_name']].iloc[start:new_id, :])
            yield chunk
            chunk = pd.DataFrame()
            start = new_id
        chunk = l[['col_name']].iloc[start:, :]

yield chunk
Fizi
  • 1,749
  • 4
  • 29
  • 55