How can I make Spark Streaming count the words in a file in a unit test?

Question

I've successfully built a very simple Spark Streaming application in Java that is based on the HdfsCount example in Scala.

When I submit this application to my local Spark, it waits for a file to be written to a given directory, and when I create that file it successfully prints the number of words. I terminate the application by pressing Ctrl+C.

Now I've tried to create a very basic unit test for this functionality, but in the test I was not able to print the same information, that is the number of words.

What am I missing?

Below is the unit test file, and after that I've also included the code snippet that shows the countWords method:

StarterAppTest.java

import com.google.common.io.Files;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;


import org.junit.*;

import java.io.*;

public class StarterAppTest {

  JavaStreamingContext ssc;
  File tempDir;

  @Before
  public void setUp() {
    ssc = new JavaStreamingContext("local", "test", new Duration(3000));
    tempDir = Files.createTempDir();
    tempDir.deleteOnExit();
  }

  @After
  public void tearDown() {
    ssc.stop();
    ssc = null;
  }

  @Test
  public void testInitialization() {
    Assert.assertNotNull(ssc.sc());
  }


  @Test
  public void testCountWords() {

    StarterApp starterApp = new StarterApp();

    try {
      JavaDStream<String> lines = ssc.textFileStream(tempDir.getAbsolutePath());
      JavaPairDStream<String, Integer> wordCounts = starterApp.countWords(lines);

      ssc.start();

      File tmpFile = new File(tempDir.getAbsolutePath(), "tmp.txt");
      PrintWriter writer = new PrintWriter(tmpFile, "UTF-8");
      writer.println("8-Dec-2014: Emre Emre Emre Ergin Ergin Ergin");
      writer.close();

      System.err.println("===== Word Counts =======");
      wordCounts.print();
      System.err.println("===== Word Counts =======");

    } catch (FileNotFoundException e) {
      e.printStackTrace();
    } catch (UnsupportedEncodingException e) {
      e.printStackTrace();
    }


    Assert.assertTrue(true);

  }

}

This test compiles and starts to run, Spark Streaming prints a lot of diagnostic messages on the console but the call to wordCounts.print() does not print anything, whereas in StarterApp.java itself, they do.

I've also tried adding ssc.awaitTermination(); after ssc.start() but nothing changed in that respect. After that I've also tried to create a new file manually in the directory that this Spark Streaming application was checking but this time it gave an error.

For completeness, below is the wordCounts method:

public JavaPairDStream<String, Integer> countWords(JavaDStream<String> lines) {
    JavaDStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
      @Override
      public Iterable<String> call(String x) { return Lists.newArrayList(SPACE.split(x)); }
    });

    JavaPairDStream<String, Integer> wordCounts = words.mapToPair(
            new PairFunction<String, String, Integer>() {
              @Override
              public Tuple2<String, Integer> call(String s) { return new Tuple2<>(s, 1); }
            }).reduceByKey((i1, i2) -> i1 + i2);

    return wordCounts;
  }

score 2 · Accepted Answer · answered Dec 08 '14 at 12:54

2

Few pointers:

Give at least 2 cores to SparkStreaming context. 1 for the Streaming and 1 for the Spark processing. "local" -> "local[2]"
Your streaming interval is of 3000ms, so somewhere in your program you need to wait -at least- that time to expect an output.
Spark Streaming needs some time for the setup of listeners. The file is being created immediately after ssc.start is issued. There's no warranty that the filesystem listener is already in place. I'd do some sleep(xx) after ssc.start

In Streaming, it's all about the right timing.

answered Dec 08 '14 at 12:54

maasg

37,100
11
88
115

Hello Gerard, I've made it use 2 cores, decreased the interval to 0.5 seconds, and added a Thread.sleep for 3 seconds: https://gist.github.com/emres/67b4eae86fa92df69f61 However, running this version still did not print any word counts as can be seen in the output at http://pastebin.com/AE2hTZbS Therefore I cannot understand whether it actually processed the file and counted the words. In the output there are lines such as "Finding new files took 82 ms", but I don't know whether this means it found and processed them. – Emre Sevinç Dec 08 '14 at 13:35
Hi Emre, try reducing the first sleep to 1sec and adding another sleep(6s) after the file.close – maasg Dec 09 '14 at 09:10
I did what you've suggested ([see new version](https://gist.github.com/emres/67b4eae86fa92df69f61)) but I still don't see the words counted [in the output](http://pastebin.com/VKf6yzs0). Though your suggestion seems very similar to what is done at the end of [this example](https://github.com/databricks/spark-perf/blob/master/streaming-tests/src/main/scala/streaming/perf/HdfsRecoveryTest.scala). This really seems very very time-sensitive indeed. – Emre Sevinç Dec 09 '14 at 12:05
1

I think you have other issues. I see this error 'java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)' probably at the point where data is being processed. This is probably related to a guava version conflict. – maasg Dec 09 '14 at 13:05
1

Hats off to your perseverance! Based on your feedback I've added [this Guava dependency explicitly to my pom.xml](https://gist.github.com/emres/ba7ad991406595c78db0), and ran the same unit test, this time to see [the output I've expected](http://pastebin.com/QTrdZ7Uz). Probably Spark submit script was taking care of this Guava conflict for the application, but unit tests, not using the submit script, are prone to such issues. I will mark your answer as accepted. – Emre Sevinç Dec 09 '14 at 13:40
Thanks a lot. Saved me a lot of time! – nish Sep 10 '15 at 13:20
With this code I just get `java.lang.IllegalStateException: Adding new inputs, transformations, and output operations after starting a context is not supported` on the second call to `wordCounts.count();` after the streaming session has been started. Could you post a complete answer? – Brad Mar 03 '17 at 23:52

How can I make Spark Streaming count the words in a file in a unit test?

StarterAppTest.java

1 Answers1

Linked