2

I have a requirement in which I need to create a sequence file.Right now we have written custom api on top of hadoop api,but since we are moving in spark we have to achieve the same using spark.Can this be achieved using spark dataframes?

mahan07
  • 887
  • 4
  • 14
  • 32
  • 2
    I don't know if there is a function to write Sequence Files using DataFrame API but you always could get the RDD of a DataFrame and then use rdd.saveAsSequenceFile method to achieve what you want. – gasparms Nov 27 '16 at 18:03

1 Answers1

1

AFAIK there is no native api available directly in DataFrame except the below approach


Please try/think some thing like(which is RDD of DataFrame style, inspired by SequenceFileRDDFunctions.scala & method saveAsSequenceFile) in below example :

Extra functions available on RDDs of (key, value) pairs to create a Hadoop SequenceFile, through an implicit conversion.

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.SequenceFileRDDFunctions
import org.apache.hadoop.io.NullWritable

object driver extends App {

   val conf = new SparkConf()
        .setAppName("HDFS writable test")
   val sc = new SparkContext(conf)

   val empty = sc.emptyRDD[Any].repartition(10)

   val data = empty.mapPartitions(Generator.generate).map{ (NullWritable.get(), _) }

   val seq = new SequenceFileRDDFunctions(data)

   // seq.saveAsSequenceFile("/tmp/s1", None)

   seq.saveAsSequenceFile(s"hdfs://localdomain/tmp/s1/${new scala.util.Random().nextInt()}", None)
   sc.stop()
}

Further information pls see ..

Community
  • 1
  • 1
Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
  • agree with gasparms please check my answer – Ram Ghadiyaram Nov 27 '16 at 18:47
  • was it helping you ? – Ram Ghadiyaram Nov 28 '16 at 05:41
  • Thanks Ram,my requirement is that I have to create a sequence file from dataframewriter,is there any way to achieve that? – mahan07 Nov 29 '16 at 07:30
  • AFAIK there is no direct `DataFrame` API available. which I added in the first line of my answer – Ram Ghadiyaram Nov 29 '16 at 10:43
  • Hi did you find anything apart from RDD approach? if yes pls. share your thoughts. – Ram Ghadiyaram Nov 30 '16 at 15:05
  • I did bit research to find out anything else is there at DataFrame api but found nothing native like that write.format methods are available for parquet, ORC but not for sequence files. please see [DataFrameWriter] (https://spark.apache.org/docs/1.5.2/api/java/org/apache/spark/sql/DataFrameWriter.html) – Ram Ghadiyaram Dec 01 '16 at 04:02
  • Hi,Right now i am doing like this in orc,but i need a file of key and values,is there any way to add key and value in orc files; DataFrame Frame11 = hiveContext.createDataFrame(record, schema); String[] partitionColumns = {"time"}; Frame11.registerTempTable("tmp"); DataFrame frame = hiveContext.sql("select * from tmp order by time"); frame.explain(true); frame.write().partitionBy(partitionColumns).orc(output_Path); – mahan07 Dec 01 '16 at 11:31
  • I don't have experience about ORC. only thing I can see from DataFrameWriter is supporting ORC Parquet not sequence file natively. – Ram Ghadiyaram Dec 01 '16 at 11:49
  • since you are using hive I would suggest Parquet (with Avro generic record.) this is really good with Spark + hive.. for example you want to use impala parquet has good compatibility wit it. – Ram Ghadiyaram Dec 01 '16 at 11:50