I have a csv file with an id column. I want to read it but I need to process all consecutive entries of an id at the same time. For example, if the "chunk size" was 2, df = pd.read_csv("data.csv", chunksize=2)
, I would only read these two first values of A whereas I need to process all 3 at the same time.
id | feature1 | feature2 |
---|---|---|
A | 1 | 2 |
A | 2 | 2 |
A | 0 | 0 |
B | 0 | 0 |
In a case such as this, I'd want to increase my chunk size by 1 so that it'll catch the remaining ids.
The data is ordered, there's no cases where I have 'A', 'B' and then 'A' again. I thought about running a script just to calculate the chunk sizes, but I'm not sure if that's the way to go.