spark streaming fileStream

Question

I'm programming with spark streaming but have some trouble with scala. I'm trying to use the function StreamingContext.fileStream

The definition of this function is like this:

def fileStream[K, V, F <: InputFormat[K, V]](directory: String)(implicit arg0: ClassManifest[K], arg1: ClassManifest[V], arg2: ClassManifest[F]): DStream[(K, V)]

Create a input stream that monitors a Hadoop-compatible filesystem for new files and reads them using the given key-value types and input format. File names starting with . are ignored. K Key type for reading HDFS file V Value type for reading HDFS file F Input format for reading HDFS file directory HDFS directory to monitor for new file

I don't know how to pass the type of Key and Value. My Code in spark streaming:

val ssc = new StreamingContext(args(0), "StreamingReceiver", Seconds(1),
  System.getenv("SPARK_HOME"), Seq("/home/mesos/StreamingReceiver.jar"))

// Create a NetworkInputDStream on target ip:port and count the
val lines = ssc.fileStream("/home/sequenceFile")

Java code to write the hadoop file:

public class MyDriver {

private static final String[] DATA = { "One, two, buckle my shoe",
        "Three, four, shut the door", "Five, six, pick up sticks",
        "Seven, eight, lay them straight", "Nine, ten, a big fat hen" };

public static void main(String[] args) throws IOException {
    String uri = args[0];
    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(URI.create(uri), conf);
    Path path = new Path(uri);
    IntWritable key = new IntWritable();
    Text value = new Text();
    SequenceFile.Writer writer = null;
    try {
        writer = SequenceFile.createWriter(fs, conf, path, key.getClass(),
                value.getClass());
        for (int i = 0; i < 100; i++) {
            key.set(100 - i);
            value.set(DATA[i % DATA.length]);
            System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key,
                    value);
            writer.append(key, value);
        }
    } finally {
        IOUtils.closeStream(writer);
    }
}

}

What issues are you seeing? Are you getting compilation errors? If so, what are they? Are you getting errors/unexpected behavior when you run your code? If you provide more context into what errors/unexpected behaviors you are seeing you are more likely to get helpful answers. — cmbaxter, May 15 '13 at 11:52

score 9 · Accepted Answer · answered May 15 '13 at 12:23

If you want to use fileStream, you're going to have to supply all 3 type params to it when calling it. You need to know what your Key, Value and InputFormat types are before calling it. If your types were LongWritable, Text and TextInputFormat, you would call fileStream like so:

val lines = ssc.fileStream[LongWritable, Text, TextInputFormat]("/home/sequenceFile")

If those 3 types do happen to be your types, then you might want to use textFileStream instead as it does not require any type params and delegates to fileStream using those 3 types I mentioned. Using that would look like this:

val lines = ssc.textFileStream("/home/sequenceFile")

Hey i am trying to do the same but with binary files, i have followed the instruction here, unfortunately it does not work. Please could you suggest something? https://stackoverflow.com/questions/45778016/reading-binaryfile-with-spark-streaming — MaatDeamon, Aug 20 '17 at 16:54

score 2 · Answer 2 · answered Oct 31 '16 at 19:00

val filterF = new Function[Path, Boolean] {
    def apply(x: Path): Boolean = {
      val flag = if(x.toString.split("/").last.split("_").last.toLong < System.currentTimeMillis) true else false
      return flag
    }
}

val streamed_rdd = ssc.fileStream[LongWritable, Text, TextInputFormat]("/user/hdpprod/temp/spark_streaming_input",filterF,false).map(_._2.toString).map(u => u.split('\t'))

spark streaming fileStream

2 Answers2

Linked