How to read a parquet file in R without using spark packages?

Question

I could find many answers online by using sparklyr or using different spark packages which actually requires spinning up a spark cluster which is an overhead. In python I could find a way to do this using "pandas.read_parquet" or Apache arrow in python - I am looking for something similar to this.

You may also be able to use Apache Arrow in future for this. There is a pull request to build R bindings for it: https://github.com/apache/arrow/pull/1815 Using them, you should be able to load Parquet files in R without spark. — Uwe L. Korn, May 13 '18 at 16:05
@xhochy Sounds great. But other than that do you think is there anything we can use now ? — Gerg, May 13 '18 at 19:03
I was using the reticulate package in R to utilize the python read_parquet. It actually works pretty good and reading the file was very fast. The only problem was, that it took like 10 times more to convert it from a pandas dataframe to a r dataframe. So in the end, I can only recommend this approach if performance is not an issue. As a bonus the files are pretty small if that's a concern (e.g. when loading from s3). Its hard to understand, that R is so much behind here. — katsumi, Aug 24 '18 at 10:26
something like this? https://github.com/elastacloud/parquetr — James Tobin, Aug 28 '18 at 04:42

score 4 · Answer 1 · answered Aug 09 '19 at 06:20

4

You can simply use the arrow package:

install.packages("arrow")
library(arrow)
read_parquet("myfile.parquet")

answered Aug 09 '19 at 06:20

fc9.30

2,293
20
19

score 2 · Accepted Answer · answered Feb 20 '19 at 17:41

With reticulate you can use pandas from python to read parquet files. This could save you the hassle from running a spark instance. May lose performance in serialization till apache arrow releases their version. As above comment mentioned.

library(reticulate)
library(dplyr)
pandas <- import("pandas")
read_parquet <- function(path, columns = NULL) {

  path <- path.expand(path)
  path <- normalizePath(path)

  if (!is.null(columns)) columns = as.list(columns)

  xdf <- pandas$read_parquet(path, columns = columns)

  xdf <- as.data.frame(xdf, stringsAsFactors = FALSE)

  dplyr::tbl_df(xdf)

}

read_parquet(PATH_TO_PARQUET_FILE)

How to read a parquet file in R without using spark packages?

2 Answers2