2

I want to do simple machine learning in Spark.

First the application should do some learning from historical data from a file, train the machine learning model and then read input from kafka to give predictions in real time. To do that I believe I should use spark streaming. However, I'm afraid that I don't really understand how spark streaming works.

The code looks like this:

def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("test App")
    val sc = new SparkContext(conf)
    val fromFile = parse(sc, Source.fromFile("my_data_.csv").getLines.toArray)
    ML.train(fromFile)

    real_time(sc)
}

Where ML is a class with some machine learning things in it and train gives it data to train. There also is a method classify which calculates predictions based on what it learned. The first part seems to work fine, but real_time is a problem:

def real_time(sc: SparkContext) : Unit = {
    val ssc = new StreamingContext(new SparkConf(), Seconds(1))
    val topic = "my_topic".split(",").toSet
    val params = Map[String, String](("metadata.broker.list", "localhost:9092"))
    val dstream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, params, topic)

    var lin = dstream.map(_._2)
    val str_arr = new Array[String](0)
    lin.foreach {
        str_arr :+ _.collect()
    }
    val lines = parse(sc, str_arr).map(i => i.features)

    ML.classify(lines)
    ssc.start()
    ssc.awaitTermination()
}

What I would like it to do is check the Kafka stream and compute it if there are any new lines. This doesn't seem to be the case, I added some prints and it is not printed.

How to use spark streaming, how should it be used in my case?

Ruslan Ostafiichuk
  • 4,422
  • 6
  • 30
  • 35
Mr M.
  • 715
  • 1
  • 8
  • 24

0 Answers0