1

I have a csv file with an id column. I want to read it but I need to process all consecutive entries of an id at the same time. For example, if the "chunk size" was 2, df = pd.read_csv("data.csv", chunksize=2), I would only read these two first values of A whereas I need to process all 3 at the same time.

id feature1 feature2
A 1 2
A 2 2
A 0 0
B 0 0

In a case such as this, I'd want to increase my chunk size by 1 so that it'll catch the remaining ids.

The data is ordered, there's no cases where I have 'A', 'B' and then 'A' again. I thought about running a script just to calculate the chunk sizes, but I'm not sure if that's the way to go.

I M
  • 313
  • 1
  • 9
  • Does this answer your question? [Load pandas dataframe with chunksize determined by column variable](https://stackoverflow.com/questions/42228770/load-pandas-dataframe-with-chunksize-determined-by-column-variable) – Ignatius Reilly Aug 14 '22 at 15:40
  • It's close, but besides breaking the entire inner logic, it does so on a case-by-case basis - so there would be tens of thousands of chunk calls... it's way too slow. I wasn't able to get the other comment with a variable chunk size to work at all. – I M Aug 14 '22 at 16:11
  • Take a look at the discussion in the comment section of the [accepted answer](https://stackoverflow.com/a/42229904/15032126). They seem to have a solution for a minimum size of chunk. But yes, lines must be evaluated one at a time. – Ignatius Reilly Aug 14 '22 at 16:26
  • Will it be fatser if you'll read the file twice? first with chunks as big as you can, just to make a list of id counts, and then second time reading the file with chunks as ordered in the list for your consecutive process. – Ze'ev Ben-Tsvi Aug 15 '22 at 07:12

1 Answers1

0

Based on the comments suggesting this accepted answer, I slightly changed the code to fit any chunk size as it was incredibly slow on large files, especially when manipulating large segments inside of them.


csv_path = "train_data.csv"
csv_reader = pd.read_csv(csv_path, iterator=True, chunksize=1, header=None)
csv_reader.get_chunk()  # This gets rid of the header. Comment this out if there's no header.

size = 200000 # this is the chunk size. 

def iter_chunk_by_id(csv_reader):
    csv_reader.chunksize = size
    first_chunk = csv_reader.get_chunk()
    id = first_chunk.iloc[-1, 0]
    chunk = pd.DataFrame(first_chunk)
    csv_reader.chunksize=1

    for l in csv_reader:
        csv_reader.chunksize = 1
        if id == l.iloc[0, 0]:
            id = l.iloc[-1, 0]
            chunk = pd.concat([chunk, l])
            continue
        id = l.iloc[0, 0]
        csv_reader.chunksize =size
        yield chunk
        chunk = pd.DataFrame(l)
    yield chunk

chunk_iter = iter_chunk_by_id(csv_reader)

You then use this just like you would normally:

for chunk in chunk_iter:
    do_something(chunk)

This will work by first taking a chunk of any arbitrary size before continuing to add chunks one by one until the IDs of the following chunks stop matching.

After the chunk is processed it changes the size back to the arbitrary size and repeats the process.

I M
  • 313
  • 1
  • 9