4

I have the following function in julia, to read an Arrow file (using Arrow.jl) to read data from disk and process it:

function getmembershipsdays(fromId, toId)
  memberships = Arrow.Table("HouseholdMemberships.arrow") |> DataFrame
  filter!([:IndividualId] => id -> id >= fromId && id <= toId, memberships)
  ...
end

> Error: ERROR: LoadError: MethodError: no method matching
> deleteat!(::Arrow.Primitive{Int64,Array{Int64,1}}, ::Array{Int64,1})

The DataFrame has the following structure:
123226x10 DataFrame
Row | MembershipId | IndividualId | HouseholdId | ...
    | Int64        | Int64        | Int64       |

The rest of the code in the function to step through the Dataframe works, but I get this error if I add the filter condition. It is as if the Dataframe columns are not converted to the underlying julia types.

if I do

m = filter([:IndividualId] => id -> id >= fromId && id <= toId, memberships)

then it works. How do I filter in place?

Bogumił Kamiński
  • 66,844
  • 3
  • 80
  • 107
Kobus Herbst
  • 415
  • 2
  • 12

1 Answers1

6

You are using memory-mapping, which means that you cannot resize the DataFrame created from Arrow.jl source in place. This is a cost you have to pay for having super-fast zero-copy creation of data frames from Arrow source.

Why was it designed this way?

  1. very often you only read data frames (without mutating them) --- in which case you might want to save the cost of copying data (especially for very large data sets).
  2. It is easy enough to use copying functions in DataFrames.jl to perform a copy (like replacing filter! with filter in your example).

See https://bkamins.github.io/julialang/2020/11/06/arrow.html for some more examples (in particular - how to avoid doing memory mapping using IO source instead of file name as source).

PS. Note that id >= fromId && id <= toId can be just written as fromId <= id <= toId.

Bogumił Kamiński
  • 66,844
  • 3
  • 80
  • 107