6

I have read a text file in Spark using the command

val data = sc.textFile("/path/to/my/file/part-0000[0-4]")

I would like to add a new line as a header of my file. Is there a way to do that without turning the RDD into an Array?

Thank you!

amarchin
  • 2,044
  • 1
  • 16
  • 32
  • 1
    create new Rdd with `val header = sc.parallize(List("\n"))` and just add these two Rdd's together `header ++ data`. But it doesnt' make sense, why do you need it? – Nikita Apr 28 '15 at 08:59
  • I'm sorry, with new line I meant a line with the names of some columns. My bad. Anyway that's exactly what I need, thank you! – amarchin Apr 28 '15 at 09:12
  • I strongly recommend you to look at DataFrames. Simply, dataframe is just rdd with some meta-information about schema and types. And keep and mind that `header ++ data` will not persist order for large Rdds. – Nikita Apr 28 '15 at 09:22

2 Answers2

2

"Part" files are automatically handled as a file set.

val data = sc.textFile("/path/to/my/file") // Will read all parts.

Just add the header and write it out:

val header = sc.parallelize(Seq("...header..."))
val withHeader = header ++ data
withHeader.saveAsTextFile("/path/to/my/modified-file")

Note that because this has to read and write all the data, it will be quite a bit slower than what you may intuitively expect. (After all you're just adding a single new line!) For this reason and others, you may be better off not adding this header, and instead storing the metadata (list of columns) separately from the data.

Daniel Darabos
  • 26,991
  • 10
  • 102
  • 114
1

You can not actually control whether new line will be first (header) or not, but you can create new singleton RDD and merge it with existent:

val extendedData = data ++  sc.makeRDD(Seq("my precious new line"))

so

extendedData.filter(_ startsWith "my precious").first() 

will probably prove your line is added

Odomontois
  • 15,918
  • 2
  • 36
  • 71
  • You can actually control whether the new line will be first. In your example it will be last because you put it after the original RDD. And what do you mean "probably"? And you don't even talk about files. – Daniel Darabos Apr 28 '15 at 21:04