If I have a csv file that's too large to load into memory with pandas (in this case 35gb), I know it's possible to process the file in chunks, with chunksize.
However I want to know if it's possible to change chunksize based on values in a column.
I have an ID column, and then several rows for each ID with information, like this:
ID, Time, x, y
sasd, 10:12, 1, 3
sasd, 10:14, 1, 4
sasd, 10:32, 1, 2
cgfb, 10:02, 1, 6
cgfb, 10:13, 1, 3
aenr, 11:54, 2, 5
tory, 10:27, 1, 3
tory, 10:48, 3, 5
ect...
I don't want to separate IDs into different chunks. for example chunks of size 4 would be processed:
ID, Time, x, y
sasd, 10:12, 1, 3
sasd, 10:14, 1, 4
sasd, 10:32, 1, 2
cgfb, 10:02, 1, 6
cgfb, 10:13, 1, 3 <--this extra line is included in the 4 chunk
ID, Time, x, y
aenr, 11:54, 2, 5
tory, 10:27, 1, 3
tory, 10:48, 3, 5
...
Is it possible?
If not perhaps using the csv library with a for loop along the lines of:
for line in file:
x += 1
if x > 1000000 and curid != line[0]:
break
curid = line[0]
#code to append line to a dataframe
although I know this would only create one chunk, and for loops take a long time to process.