Unable to filter DataFrame created from Arrow table

Question

I have the following function in julia, to read an Arrow file (using Arrow.jl) to read data from disk and process it:

function getmembershipsdays(fromId, toId)
  memberships = Arrow.Table("HouseholdMemberships.arrow") |> DataFrame
  filter!([:IndividualId] => id -> id >= fromId && id <= toId, memberships)
  ...
end

> Error: ERROR: LoadError: MethodError: no method matching
> deleteat!(::Arrow.Primitive{Int64,Array{Int64,1}}, ::Array{Int64,1})

The DataFrame has the following structure:
123226x10 DataFrame
Row | MembershipId | IndividualId | HouseholdId | ...
    | Int64        | Int64        | Int64       |

The rest of the code in the function to step through the Dataframe works, but I get this error if I add the filter condition. It is as if the Dataframe columns are not converted to the underlying julia types.

if I do

m = filter([:IndividualId] => id -> id >= fromId && id <= toId, memberships)

then it works. How do I filter in place?

score 6 · Accepted Answer · answered Jan 04 '21 at 15:47

You are using memory-mapping, which means that you cannot resize the DataFrame created from Arrow.jl source in place. This is a cost you have to pay for having super-fast zero-copy creation of data frames from Arrow source.

Why was it designed this way?

very often you only read data frames (without mutating them) --- in which case you might want to save the cost of copying data (especially for very large data sets).
It is easy enough to use copying functions in DataFrames.jl to perform a copy (like replacing filter! with filter in your example).

See https://bkamins.github.io/julialang/2020/11/06/arrow.html for some more examples (in particular - how to avoid doing memory mapping using IO source instead of file name as source).

PS. Note that id >= fromId && id <= toId can be just written as fromId <= id <= toId.

Unable to filter DataFrame created from Arrow table

1 Answers1