Spark Streaming - Travis CI and GitHub custom receiver - continuous data but empty RDD?

Question

Lately, as a part of a scientific research, I've been developing an application that streams (or at least should) data from Travis CI and GitHub, using their REST API's. The purpose of this is to get insight into the commit-build relationship, in order to further perform numerous analysis.

For this, I've implemented the following Travis custom receiver:

object TravisUtils { 

  def createStream(ctx : StreamingContext, storageLevel: StorageLevel) : ReceiverInputDStream[Build] = new TravisInputDStream(ctx, storageLevel) 

} 

private[streaming] 
class TravisInputDStream(ctx : StreamingContext, storageLevel : StorageLevel) extends ReceiverInputDStream[Build](ctx) { 

  def getReceiver() : Receiver[Build] = new TravisReceiver(storageLevel) 

} 

private[streaming] 
class TravisReceiver(storageLevel: StorageLevel) extends Receiver[Build](storageLevel) with Logging { 

  def onStart() : Unit = { 
     new BuildStream().addListener(new BuildListener { 

       override def onBuildsReceived(numberOfBuilds: Int): Unit = { 

       } 

       override def onBuildRepositoryReceived(build: Build): Unit = { 
         store(build) 
       } 

       override def onException(e: Exception): Unit = { 
         reportError("Exception while streaming travis", e) 
       } 
    }) 
  } 

  def onStop() : Unit = { 

  } 
}

Whereas the receiver uses my custom made TRAVIS API library (developed in Java using Apache Async Client). However, the problem is the following: the data that I should be receiving is continuous and changes i.e. is being pushed to Travis and GitHub constantly. As an example, consider the fact that GitHub records per second approx. 350 events - including push events, commit comment and similar.

But, when streaming either GitHub or Travis, I do get the data from the first two batches, but then afterwards, the RDD's apart of the DStream are empty - although there is data to be streamed!

I've checked so far couple of things, including the HttpClient used for omitting requests to the API, but none of them did actually solve this problem.

Therefore, my question is - what could be going on? Why isn't Spark streaming the data after period x passes. Below, you may find the set context and configuration:

val configuration = new SparkConf().setAppName("StreamingSoftwareAnalytics").setMaster("local[2]") 

val ctx = new StreamingContext(configuration, Seconds(3)) 

val stream = GitHubUtils.createStream(ctx, StorageLevel.MEMORY_AND_DISK_SER) 

// RDD IS EMPTY - that is what is happenning! 
stream.window(Seconds(9)).foreachRDD(rdd => { 
  if (rdd.isEmpty()) {println("RDD IS EMPTY")} else {rdd.collect().foreach(event => println(event.getRepo.getName + " " + event.getId))} 
}) 

ctx.start() 
ctx.awaitTermination()

Thanks in advance!

What do you see in the streaming tab of the driver web page (http://localhost:4040 by default)? — Leandro, Mar 07 '16 at 04:14
The number of events retrieved for at the second batch interval was equal to 25, which is indeed correct. However, the size further on is equal to 0 events. Any clue what might be the problem? In the meanwhile, I even tried wrapping the onStart() code inside of a new Thread, starting it, and terminating upon acknowledgement - but still the same. — dsafa, Mar 07 '16 at 15:16
Do you see the same behavior if you use a fake source, like a loop that generates fake events and stores them in the RDD? Try to do that to see if you can identify which point is not working - the streaming processing or the receiver itself. Also, that a look at this [spark issue](https://issues.apache.org/jira/browse/SPARK-10995) as it is related to windowing the stream. — Leandro, Mar 07 '16 at 18:27
We'll, after doing some research I found that the problem is in the Receiver, which does not stop processing the data, hence a new receiver is not being created - which is quite odd but it might happen for certain reasons such as API limitations/restrictions. To avoid this, is it possible to define a timeout for the receiver or similar, which would indicate to stop? — dsafa, Mar 07 '16 at 21:26
So, is your BuildStream class creating an Async Apache Client and waiting for events and, in some cases, it stops receiving the events, making your Receiver stop storing them, right? If I understood it right, I think you should change the way your BuildStream works to handle failures and restart the client instead of creating a new Receiver when there is an error. — Leandro, Mar 08 '16 at 02:36

Spark Streaming - Travis CI and GitHub custom receiver - continuous data but empty RDD?

0 Answers0