4

I am working with very large DataFrames in Julia resulting in out of memory errors when I do joins and other manipulations on the data. Fortunately the the data can be partitioned on an identifier column. I want to persist the partitioned DataFrame using the record batches feature build into Arrow.jl, and then read and process each record batch in turn. I have managed to get the following to work, but are unable to get the original DataFrame back on reading the data. On reading back the data I get a DataFrame with all the data in each column an array of the data in the original partition. I don't know whether my problem is how I am creating the partitions in the first place or on how I am reading back the data:

using Random
using DataFrames
using Arrow

function nextidrange(minId, maxId, batchsize, i)
    fromId = minId + batchsize * (i-1)
    toId = min(maxId, (minId + batchsize * i)-1)
    return fromId, toId
end

minId = 1
maxId = 1000
idrange = (maxId - minId) + 1
df = DataFrame(ID=minId:maxId, B=rand(idrange), C=randstring.(fill(5,idrange)));
batchsize = 100
batches = ceil(Int32, idrange / batchsize)
partitions = Array{SubDataFrame}(undef, 0)
for i = 1:batches
    fromId, toId = nextidrange(minId, maxId, batchsize, i)
    push!(partitions, filter([:ID] => x -> fromId <= x <= toId, df; view = true))
end
io = IOBuffer()
Arrow.write(io, partitions)
seekstart(io)
batches = Arrow.Stream(io)
for b in batches
  bt = b |> DataFrame
  println("Rows = $(nrow(bt))")
end

For each record batch I am expecting a DataFrame with three columns and 100 rows of data. Implementation notes: In the actual data there may be gaps in the identifier values. I have considered using JuliaDB, but DataFrames appears to be much better maintained and supported.

Bogumił Kamiński
  • 66,844
  • 3
  • 80
  • 107
Kobus Herbst
  • 415
  • 2
  • 12

2 Answers2

2

I have resolved my problem, like this:

using Random
using DataFrames
using Arrow
using Tables

function nextidrange(minId, maxId, batchsize, i)
    fromId = minId + batchsize * (i-1)
    toId = min(maxId, (minId + batchsize * i)-1)
    return fromId, toId
end

minId = 1
maxId = 1000
idrange = (maxId - minId) + 1
df = DataFrame(ID=minId:maxId, B=rand(idrange), C=randstring.(fill(5,idrange)));
batchsize = 100
numbatches = ceil(Int32, idrange / batchsize)
partitions = Array{SubDataFrame}(undef, 0)
for i = 1:numbatches 
    fromId, toId = nextidrange(minId, maxId, batchsize, i)
    push!(partitions, filter([:ID] => x -> fromId <= x <= toId, df; view = true))
end
io = IOBuffer()
Arrow.write(io, Tables.partitioner(partitions))
seekstart(io)
recordbatches = Arrow.Stream(io)
ab = Array{DataFrame}(undef,0)
for b in recordbatches 
  bt = b |> DataFrame
  println("Rows = $(nrow(bt))")
  push!(ab,bt)
end

The issue was that the array of DataFrame views should be placed in a call to Tables.partitioner

Kobus Herbst
  • 415
  • 2
  • 12
  • Yeah, that's the right way to do it (primary Tables.jl and Arrow.jl author here). Did the Arrow.jl docs provide enough direction for your use-case here? I've been thinking we could probably include some more examples for people. – quinnj Jan 25 '21 at 18:02
  • To some extend, the statement "Note you can turn multiple table objects into partitions by doing `Tables.partitioner([tbl1, tbl2, ...])`" was really the only concrete clue. To read batches, I had to delve into the source code and test cases. If I have `n` Arrow files on disk, all with the same schema, how do I concatenate those into a single Arrow file where each file is a record batch? Great libraries btw, appreciated. Busy converting a Java-based data processing pipeline that use to take 10-12 days, it looks like I will get it down to a day using Julia. – Kobus Herbst Jan 26 '21 at 03:44
  • 1
    To concatenate existing arrow files w/ the same schema, I'd do something like: ```julia Arrow.write("concatenated.arrow", Tables.partitioner(x->Arrow.Table(x), arrow_files)) ``` where `arrow_files` is an array of arrow files as strings. Note that this "functional" form of `Tables.partitioner` applies the mapping function lazily as each partition is processed, to avoid having to load all the arrow tables at once in memory. – quinnj Jan 27 '21 at 04:31
0

I think beginning with DataFrames v1.5.0 you can partition a DataFrame directly and write Arrow.write(filename, Iterators.partition(df, n)) as noted by Bogumił Kamiński.

phntm
  • 511
  • 2
  • 11