How can I append data to each partition of an Arrow.Table created from a GroupedDataFrame in Julia?

Question

I have a GroupedDataFrame GDF1 that I want to save as an Arrow.Table with each subdataframe as a separate partition. I am currently using Arrow.append to achieve this. However, I want to be able to append data to each partition after it is created without having to create a new Arrow.Table.

If I try to append a new GroupedDataFrame GDF2 to the existing Arrow.Table using Arrow.append, it creates a new Arrow.Table with additional partitions instead of appending the data to each original partition.

Here is the code I am currently using:

GDF1 = groupby(DataFrame(ID = [ "Eng1", "Eng2"] , Date = [Date(2023,4,10),Date(2023,4,10)], Time = [3.85, 4.13]), :ID)

File = "oldfilepath"

for i in GDF1
    Arrow.append(File, i)
end

GDF2 = groupby(DataFrame(ID = [ "Eng1", "Eng2"] , Date = [Date(2023,4,12),Date(2023,4,12)], Time = [3.87, 4.14]), :ID)

for i in GDF2
    Arrow.append(File, i)
end

Which yields

View = DataFrame(Arrow.Table(File))
4×3 DataFrame
 Row │ ID      Date        Time    
     │ String  Date        Float64 
─────┼─────────────────────────────
   1 │ Eng1    2023-04-10     3.85
   2 │ Eng2    2023-04-10     4.13
   3 │ Eng1    2023-04-12     3.87
   4 │ Eng2    2023-04-12     4.14

I would like the resulting Arrow.Table to retain the initial 2 partitions with new data appended to each partition. e.g.

4×3 DataFrame
 Row │ ID      Date        Time    
     │ String  Date        Float64 
─────┼─────────────────────────────
   1 │ Eng1    2023-04-10     3.85
   2 │ Eng1    2023-04-12     3.87
   3 │ Eng2    2023-04-10     4.13
   4 │ Eng2    2023-04-12     4.14

Is there a way to update each partition of the Arrow.Table without creating a new Arrow.Table? I want to avoid constantly writing the combined DataFrame into a new file due to the file size. The given use case GDF1 is very large, so I am unable to append GDF2 to GDF1 and then call sort! because this would require that I bring the original DataFrame into RAM. Any thoughts on the best way to accomplish this would be greatly appreciated.

I do not think it is possible to achieve what you want using the same file. What I would do is save each group as a separate Arrow.jl file. Then you would be able to append to these individual files. — Bogumił Kamiński, Apr 21 '23 at 20:23

How can I append data to each partition of an Arrow.Table created from a GroupedDataFrame in Julia?

0 Answers0