I have read a text file in Spark using the command
val data = sc.textFile("/path/to/my/file/part-0000[0-4]")
I would like to add a new line as a header of my file. Is there a way to do that without turning the RDD into an Array?
Thank you!
I have read a text file in Spark using the command
val data = sc.textFile("/path/to/my/file/part-0000[0-4]")
I would like to add a new line as a header of my file. Is there a way to do that without turning the RDD into an Array?
Thank you!
"Part" files are automatically handled as a file set.
val data = sc.textFile("/path/to/my/file") // Will read all parts.
Just add the header and write it out:
val header = sc.parallelize(Seq("...header..."))
val withHeader = header ++ data
withHeader.saveAsTextFile("/path/to/my/modified-file")
Note that because this has to read and write all the data, it will be quite a bit slower than what you may intuitively expect. (After all you're just adding a single new line!) For this reason and others, you may be better off not adding this header, and instead storing the metadata (list of columns) separately from the data.
You can not actually control whether new line will be first (header) or not, but you can create new singleton RDD and merge it with existent:
val extendedData = data ++ sc.makeRDD(Seq("my precious new line"))
so
extendedData.filter(_ startsWith "my precious").first()
will probably prove your line is added