7

I am trying to read and write a trivial dataset into Julia. The dataset is mtcars, taken from R, with an arbitrarily added column bt with random Boolean values. The file/folder structure (below) was written out using the R arrow package.

The files are laid out as follows:

arr
|-- bt=false
|   `-- part-1.arrow
`-- bt=true
    `-- part-0.arrow

How can I faithfully reproduce the original table in Julia?

What I've tried so far:

  1. Using the Parquet.jl package. Documentation suggests that it should automatically detect partitioning folder structure for columns of bool/string/date type. When I read the data in, using read_parquet(path; kwargs), the resulting data structure does not have the bt column. I've tried setting the column_generator keyword argument to the default Parquet.dataset_column_generator but this did not work.

  2. Using Arrow.jl - I cannot find a documented way (unless I misunderstood) to directly read in a partitioned data structure.

R does not generate additional metadata files to store the schema, but I understand this is optional and not part of the arrow spec?

tinker
  • 96
  • 2

1 Answers1

1

Try this. They have listed a method as this

Partitions in a parquet file or dataset can also be iterated over using an iterator returned by the Tables.partitions method.

using Parquet, DataFrames
for partition in Tables.partitions(read_parquet(path))
    df = DataFrame(partition)
    ...
end

For further reference: https://github.com/JuliaIO/Parquet.jl