2

I have been trying to implement below pyarrow code in java but could not find anything. can you please suggest is it even possible to implement below code in java arrow or is there any alternative library to achieve this

table1 = pq.read_table('/Users/some-user/Downloads/' + file_name + '.parquet')

ds.write_dataset(table1, base_dir='/Users/some-user/hive', partitioning=['column'], partitioning_flavor='hive', max_partitions=10000, format='parquet', use_threads=True, existing_data_behavior='delete_matching')
thebluephantom
  • 16,458
  • 8
  • 40
  • 83

1 Answers1

0

For Arrow Java side, you could use Dataset module that offer reads capabilities of parquet files (write support, base on PR opened, it is under development).

For Spark side, you could use this Github example about how do you could implement that. Base on that examples, your code could be something like this:

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class SparkRecipe {
  public static void main(String[] args) {
    SparkSession spark = SparkSession
        .builder()
        .appName("RW-with-partition")
        .config("spark.master", "local")
        .getOrCreate();
    // File at: https://github.com/apache/spark/blob/a92ef00145b264013e11de12f2c7cee62c28198d/examples/src/main/resources/users.parquet
    Dataset<Row> usersDF = spark.read().load("src/main/resources/parquet/users.parquet");
    usersDF.printSchema();
    /*
    root
     |-- name: string (nullable = true)
     |-- favorite_color: string (nullable = true)
     |-- favorite_numbers: array (nullable = true)
     |    |-- element: integer (containsNull = true)
     */
    usersDF.show();
    /*
    +------+--------------+----------------+
    |  name|favorite_color|favorite_numbers|
    +------+--------------+----------------+
    |Alyssa|          null|  [3, 9, 15, 20]|
    |   Ben|           red|              []|
    +------+--------------+----------------+
     */
    usersDF
        .write()
        .partitionBy("favorite_color")
        .format("parquet")
        .save("src/main/resources/parquet/partbycolo/names.parquet");
  }
}

Please let us know if this work on your side.