How to read/write partitioned Apache Arrow or Parquet files into/out of Julia

Question

I am trying to read and write a trivial dataset into Julia. The dataset is mtcars, taken from R, with an arbitrarily added column bt with random Boolean values. The file/folder structure (below) was written out using the R arrow package.

The files are laid out as follows:

arr
|-- bt=false
|   `-- part-1.arrow
`-- bt=true
    `-- part-0.arrow

How can I faithfully reproduce the original table in Julia?

What I've tried so far:

Using the Parquet.jl package. Documentation suggests that it should automatically detect partitioning folder structure for columns of bool/string/date type. When I read the data in, using read_parquet(path; kwargs), the resulting data structure does not have the bt column. I've tried setting the column_generator keyword argument to the default Parquet.dataset_column_generator but this did not work.
Using Arrow.jl - I cannot find a documented way (unless I misunderstood) to directly read in a partitioned data structure.

R does not generate additional metadata files to store the schema, but I understand this is optional and not part of the arrow spec?

Unfortunately your best bet might be using `Pycall` to use the python parquet reader to read this file correctly — BallpointBen, May 19 '21 at 18:56
That is unfortunate; I will give it a try nonetheless. Thank you. — tinker, May 20 '21 at 08:26
Could you provide link to download this dataset? It would help with trying out what works and what does not. — Matěj Račinský, May 21 '21 at 19:38
Thanks. This is the dataset in Arrow format: https://send.vis.ee/download/18cb5247bc34f898/#ZXfAhzog1OIeX4XhZit22Q — tinker, May 23 '21 at 08:29
There is an issue open for this at Parquet.jl: https://github.com/JuliaIO/Parquet.jl/issues/154 — Merlin, Jul 17 '21 at 20:58

score 1 · Answer 1 · answered Oct 01 '21 at 14:30

Try this. They have listed a method as this

Partitions in a parquet file or dataset can also be iterated over using an iterator returned by the Tables.partitions method.

using Parquet, DataFrames
for partition in Tables.partitions(read_parquet(path))
    df = DataFrame(partition)
    ...
end

For further reference: https://github.com/JuliaIO/Parquet.jl

How to read/write partitioned Apache Arrow or Parquet files into/out of Julia

1 Answers1