6

I am able to write it into

  • ORC

  • PARQUET

    directly and

  • TEXTFILE

  • AVRO

using additional dependencies from databricks.

    <dependency>
        <groupId>com.databricks</groupId>
        <artifactId>spark-csv_2.10</artifactId>
        <version>1.5.0</version>
    </dependency>
    <dependency>
        <groupId>com.databricks</groupId>
        <artifactId>spark-avro_2.10</artifactId>
        <version>2.0.1</version>
    </dependency>

Sample code:

    SparkContext sc = new SparkContext(conf);
    HiveContext hc = new HiveContext(sc);
    DataFrame df = hc.table(hiveTableName);
    df.printSchema();
    DataFrameWriter writer = df.repartition(1).write();

    if ("ORC".equalsIgnoreCase(hdfsFileFormat)) {
        writer.orc(outputHdfsFile);

    } else if ("PARQUET".equalsIgnoreCase(hdfsFileFormat)) {
        writer.parquet(outputHdfsFile);

    } else if ("TEXTFILE".equalsIgnoreCase(hdfsFileFormat)) {
        writer.format("com.databricks.spark.csv").option("header", "true").save(outputHdfsFile);

    } else if ("AVRO".equalsIgnoreCase(hdfsFileFormat)) {
        writer.format("com.databricks.spark.avro").save(outputHdfsFile);
    }

Is there any way to write dataframe into hadoop SequenceFile and RCFile?

Community
  • 1
  • 1
Dev
  • 13,492
  • 19
  • 81
  • 174

1 Answers1

2

You can use void saveAsObjectFile(String path) to save a RDD as a SequenceFile of serialized objects. So in your case you have to to retrieve the RDD from the DataFrame:

JavaRDD<Row> rdd = df.javaRDD;
rdd.saveAsObjectFile(outputHdfsFile);
Dev
  • 13,492
  • 19
  • 81
  • 174
nicoring
  • 633
  • 5
  • 12
  • I am not completely sure, but I don't think Spark supports writing into RCFiles out of the box, after skimming through the documentation. I suppose you have to use something like Parquet. – nicoring Oct 18 '16 at 15:51
  • @dev ツ Could you mark this as an answer if it answered your question? – nicoring Nov 15 '16 at 10:41