10

I have a Spark program (in Scala) and a SparkContext. I am writing some files with RDD's saveAsTextFile. On my local machine I can use a local file path and it works with the local file system. On my cluster it works with HDFS.

I also want to write other arbitrary files as the result of processing. I'm writing them as regular files on my local machine, but want them to go into HDFS on the cluster.

SparkContext seems to have a few file-related methods but they all seem to be inputs not outputs.

How do I do this?

Joe
  • 46,419
  • 33
  • 155
  • 245

4 Answers4

15

Thanks to marios and kostya, but there are few steps to writing a text file into HDFS from Spark.

// Hadoop Config is accessible from SparkContext
val fs = FileSystem.get(sparkContext.hadoopConfiguration); 

// Output file can be created from file system.
val output = fs.create(new Path(filename));

// But BufferedOutputStream must be used to output an actual text file.
val os = BufferedOutputStream(output)

os.write("Hello World".getBytes("UTF-8"))

os.close()

Note that FSDataOutputStream, which has been suggested, is a Java serialized object output stream, not a text output stream. The writeUTF method appears to write plaint text, but it's actually a binary serialization format that includes extra bytes.

Joe
  • 46,419
  • 33
  • 155
  • 245
  • If you might be working with S3 or other file systems, it works better to get the FileSystem from the Path: `val path = new Path(fileName); val fs = path.getFileSystem(sparkContext.hadoopConfiguration)` – Matt Nov 06 '18 at 22:55
5

Here's what worked best for me (using Spark 2.0):

val path = new Path("hdfs://namenode:8020/some/folder/myfile.txt")
val conf = new Configuration(spark.sparkContext.hadoopConfiguration)
conf.setInt("dfs.blocksize", 16 * 1024 * 1024) // 16MB HDFS Block Size
val fs = path.getFileSystem(conf)
if (fs.exists(path))
    fs.delete(path, true)
val out = new BufferedOutputStream(fs.create(path)))
val txt = "Some text to output"
out.write(txt.getBytes("UTF-8"))
out.flush()
out.close()
fs.close()
Prasad Khode
  • 6,602
  • 11
  • 44
  • 59
Martin Tapp
  • 3,106
  • 3
  • 32
  • 39
2

Using HDFS API (hadoop-hdfs.jar) you can create InputStream/OutputStream for an HDFS path and read from/write to a file using regular java.io classes. For example:

URI uri = URI.create (“hdfs://host:port/file path”);
Configuration conf = new Configuration();
FileSystem file = FileSystem.get(uri, conf);
FSDataInputStream in = file.open(new Path(uri));

This code will work with local files as well (change hdfs:// to file://).

kostya
  • 9,221
  • 1
  • 29
  • 36
2

One simple way to write files to HDFS is to use a SequenceFiles. Here you use the native Hadoop APIs and not the ones provided by Spark.

Here is a simple snippet (in Scala):

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.hadoop.io._ 

val conf = new Configuration() // Hadoop configuration 
val sfwriter = SequenceFile.createWriter(conf,
              SequenceFile.Writer.file(new Path("hdfs://nn1.example.com/file1")),
              SequenceFile.Writer.keyClass(LongWritable.class),
              SequenceFile.Writer.valueClass(Text.class))
val lw = new LongWritable()
val txt = new Text()
lw.set(12)
text.set("hello")
sfwriter.append(lw, txt)
sfwriter.close()
...

In case you don't have a key you can use NullWritable.class in its place:

SequenceFile.Writer.keyClass(NullWritable.class)
sfwriter.append(NullWritable.get(), new Text("12345"));
marios
  • 8,874
  • 3
  • 38
  • 62
  • Thanks. Will this work for the local file system too? And can I pick up the hadoop config from Spark (e.g. hostname, port)? – Joe Oct 05 '15 at 16:59
  • Hi Joe, I am pretty sure you if you replace the path with a local file ("file://..") it will work. Also you can get the Hadoop configuration from Spark using `sc.hadoopConfiguration`. Let me know if you are having any issues. – marios Oct 05 '15 at 17:03
  • Thanks very much. It looks like, from the docs (and from my test) this doesn't give me a plain text file. A SequenceFile has a header and binary format. – Joe Oct 05 '15 at 19:57
  • Correct, it will not give a text file. Sequence files are binary and compressed. You didn't say you wanted a text file in your description :). In any case, if you want to quickly check the context of the file you can use: `hadoop fs -text hdfs://nn1.example.com/file1` to see it's content. – marios Oct 05 '15 at 20:11
  • You're right I didn't but I did mention `saveAsTextFile`... There must be a way to write a text file directly. – Joe Oct 05 '15 at 20:17
  • Maybe check this out [http://stackoverflow.com/questions/16000840/write-a-file-in-hdfs-with-java](http://stackoverflow.com/questions/16000840/write-a-file-in-hdfs-with-java) – marios Oct 05 '15 at 20:25
  • Thanks for all your help. The missing link was getting the hadoop context from the sparkcontext (obvious but I missed it). – Joe Oct 06 '15 at 10:56