1

I need to read parquet files stored on HDFS (I have a Kerberos-protected Hadoop cluster) in my R program. I came across a couple of packages, but none of them completely satisfy what I need

  • rhadoop: It looks like an old project with no further development. rhdfs package under these libraries does not support parquet files or Kerberos.
  • arrow: It seems like it can read parquet files, but there is no connectivity to HDFS

Is there any other library which let me read parquet files from HDFS in R?

I'm aware of sparklyr, but I believe I need to install spark on the machine which runs the spark driver? Is that correct? My R client is a different machine.

Pete Ythong
  • 305
  • 5
  • 13
HHH
  • 6,085
  • 20
  • 92
  • 164
  • Couldn't you just mount HDFS and read it in as a normal file with `arrow`? – thc Sep 25 '19 at 20:25
  • 1
    Do you already have Spark installed in your cluster with access to hdfs? If yes, you can use sparklyr and connect to it from a remote machine via livy. – AEF Sep 26 '19 at 11:35
  • Spark is indeed a solution - either from a SparkR shell (i.e. R within a Java JVM running the "driver" of a Spark session), or the other way around as suggested by @AEF. In theory, you could also use a Python interface such as `reticulate` https://blog.rstudio.com/2018/03/26/reticulate-r-interface-to-python/ then Python modules such as `hdfs3` + `fastparquet` https://fastparquet.readthedocs.io/en/latest/ – Samson Scharfrichter Sep 27 '19 at 08:18
  • ... or `pyarrow` + `libhdfs` https://arrow.apache.org/docs/python/filesystems.html?highlight=libhdfs%20jni – Samson Scharfrichter Sep 27 '19 at 08:23
  • Note that you can run a Spark driver on _any_ machine that has network access to the Hadoop cluster (HDFS + YARN network ports in "local" mode, plus random ports in "yarn-client" mode). But on a Windows PC it's a bit tricky to set up -- and _very_ tricky with Kerberos authentication because then the Hadoop client libs require some of the server libs which require some "native" DLLs that have no official build, cf. https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-tips-and-tricks-running-spark-windows.html – Samson Scharfrichter Sep 27 '19 at 08:36
  • All things told, the easiest way for you to read the data inside Parquet files could be a Hive table and `RODBC`... – Samson Scharfrichter Sep 27 '19 at 08:45

0 Answers0