3

I have a dataframe that I'd like to save using Arrow.write().

I can save a subframe of it by omitting one column. But if I leave the column in, I get this error:

ArgumentError: type does not have a definite number of fields

The objects in this column are all 4-Tuples, and their elements are all either empty Tuples or 1- or 2-Tuples of Int64s. Typical examples would be ((1), (), (2), ()) and ((1, 2), (), (), ()). If I use Arrays of Arrays rather than Tuples of Tuples, it works just fine. I prefer to use tuples, and I would prefer not to have to process data before writing and after reading it (note that this also rules out things like using four separate columns -- plus I suspect having 2-tuples and 1-tuples and empty tuples in the same column would produce the same error).

I don't really understand the meaning of the error here, so I'm not sure how to fix it. Is there an easy fix? Or do I need to use arrays instead?

Here is a minimal working example which gives me this error:

using Arrow, DataFrames

x = ((1,), (1,), (), ());
y = ((1, 2), (), (), ());
df = DataFrame(col = [x, y]);
Arrow.write("test.arrow", df)

If I use col=[x] or col=[y], it works, so the problem stems from having both tuple shapes in the same vector. Maybe this is a fundamental limitation of Arrow?

More details on the error message: The error message comes from reflection.jl on line 764, in fieldcount(@nospecialize t). This function is called by Arrow's arrowvector (in `arraytypes/struct.jl'). Here is the full function definition:

function arrowvector(::StructKind, x, i, nl, fi, de, ded, meta; kw...)
    len = length(x)
    validity = ValidityBitmap(x)
    T = Base.nonmissingtype(eltype(x))
    data = Tuple(arrowvector(ToStruct(x, j), i, nl + 1, j, de, ded, nothing; kw...) for j = 1:fieldcount(T))
    return Struct{withmissing(eltype(x), namedtupletype(T, data)), typeof(data)}(validity, data, len, meta)
end

fieldcount is called on line 5, but I don't know what T will be for my use case.

2 Answers2

3

Probably you need to update your packages, because your problem is not reproducible under the current versions of these packages.

PS It is very difficult to find any good reason on earth to save such a structure in a data frame. Transform your data in such a way that each column has an optimal structure for data manipulation (like, Int, Float64,...)

passerby
  • 39
  • 1
  • Updating packages didn't help, and using only the necessary packages for my application didn't help either. I will update the question with some traceback info. – Sort of Damocles Jan 22 '22 at 14:02
  • can you check that my MWE in fact does not give an error on your computer? You claimed thr problem is not reproducible before I posted it, and I worry your answer, which did not solve the problem, is keeping others from considering the question. – Sort of Damocles Jan 22 '22 at 19:19
  • @passerby Nested structures are *very* valuable in many cases. For instance, if you have time-series data from many sources, it can be very nice to store meta-data for the source with a list of observations for that source. This would be handled in a relational table using a star schema, but the nested form is much handier in many cases because it preserves locality ... all of the observations for a single source are contiguous. That, in turn, can improve performance by 1000x both in time and space. – Ted Dunning Jan 22 '22 at 23:04
2

The problem is fixed by explicitly typing the array before constructing the DataFrame. Here is a fixed working example:

using Arrow, DataFrames

x = ((1,), (1,), (), ());
y = ((1, 2), (), (), ());
T = Union{
    Tuple{Tuple{Int64}, Tuple{Int64}, Tuple{}, Tuple{}},
    Tuple{Tuple{Int64, Int64}, Tuple{}, Tuple{}, Tuple{}}
};
C = T[x, y];
df = DataFrame(col = C);
Arrow.write("test.arrow", df)
  • So it seems that the problem was a highly non-specific type that the DataFrame had inferred? – Ted Dunning Jan 22 '22 at 23:05
  • Yes I think that's right. I'm not sure if it was julia or dataframes, but that was the source of the problem. The inferred type was NTuple{4, Tuple{Vararg{T,N}} – Sort of Damocles Jan 24 '22 at 21:59