How to get the Hadoop path with Java/Scala API in Code Repositories

Question

My need is to read other formats: JSON, binary, XML and infer the schema dynamically within a transform in Code Repositories and using Spark datasource api.

Example:

val df = spark.read.json(<hadoop_path>)

For that, I need an accessor to the Foundry file system path, which is something like:

foundry://...@url:port/datasets/ri.foundry.main.dataset.../views/ri.foundry.main.transaction.../startTransactionRid/ri.foundry.main.transaction...

This is possible with PySpark API (Python):

filesystem = input_transform.filesystem()
hadoop_path = filesystem.hadoop_path

However, for Java/Scala I didn’t find a way to do it properly.

Have you tried https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L402 ? — fmsf, Dec 06 '21 at 12:26
Yes, I managed to read the files as dataset of String, and then json method to infer the Schema and it works. For XML (with com.databricks.spark-xml library), it doesn't work as expected (maybe I need to add some options). However, my need is more general and is how to get the hadoop path with Palantir Foundry API in JAVA ? — Mehdi, Dec 06 '21 at 13:25

score 0 · Accepted Answer · answered Sep 08 '22 at 14:42

0

The getter to the Hadoop path has been recently added to Foundry Java API. By upgrading the version of the java transform (transformsJavaVersion >= 1.188.0), and you can get it:

val hadoopPath = myInput.asFiles().getFileSystem().hadoopPath()

answered Sep 08 '22 at 14:42

Mehdi

140
10

How to get the Hadoop path with Java/Scala API in Code Repositories

1 Answers1