I have a GroupedDataFrame GDF1 that I want to save as an Arrow.Table with each subdataframe as a separate partition. I am currently using Arrow.append to achieve this. However, I want to be able to append data to each partition after it is created without having to create a new Arrow.Table.
If I try to append a new GroupedDataFrame GDF2 to the existing Arrow.Table using Arrow.append, it creates a new Arrow.Table with additional partitions instead of appending the data to each original partition.
Here is the code I am currently using:
GDF1 = groupby(DataFrame(ID = [ "Eng1", "Eng2"] , Date = [Date(2023,4,10),Date(2023,4,10)], Time = [3.85, 4.13]), :ID)
File = "oldfilepath"
for i in GDF1
Arrow.append(File, i)
end
GDF2 = groupby(DataFrame(ID = [ "Eng1", "Eng2"] , Date = [Date(2023,4,12),Date(2023,4,12)], Time = [3.87, 4.14]), :ID)
for i in GDF2
Arrow.append(File, i)
end
Which yields
View = DataFrame(Arrow.Table(File))
4×3 DataFrame
Row │ ID Date Time
│ String Date Float64
─────┼─────────────────────────────
1 │ Eng1 2023-04-10 3.85
2 │ Eng2 2023-04-10 4.13
3 │ Eng1 2023-04-12 3.87
4 │ Eng2 2023-04-12 4.14
I would like the resulting Arrow.Table to retain the initial 2 partitions with new data appended to each partition. e.g.
4×3 DataFrame
Row │ ID Date Time
│ String Date Float64
─────┼─────────────────────────────
1 │ Eng1 2023-04-10 3.85
2 │ Eng1 2023-04-12 3.87
3 │ Eng2 2023-04-10 4.13
4 │ Eng2 2023-04-12 4.14
Is there a way to update each partition of the Arrow.Table without creating a new Arrow.Table? I want to avoid constantly writing the combined DataFrame into a new file due to the file size. The given use case GDF1 is very large, so I am unable to append GDF2 to GDF1 and then call sort! because this would require that I bring the original DataFrame into RAM. Any thoughts on the best way to accomplish this would be greatly appreciated.