3

Objective: Continuously feeding sniffed network packages into a Kafka Producer, connecting this to Spark Streaming to be able to process package data, After that, using the preprocessed data in Tensorflow or Keras.

I'm processing continuous data in Spark Streaming (PySpark) which comes from Kafka and now I want to send processed data to Tensorflow. How can I use these Transformed DStreams in Tensorflow with Python? Thanks.

Currently no processing applied in Spark Streaming but will be added later. Here's the py code:

import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark.conf import SparkConf
from datetime import datetime

if __name__ == '__main__':
    sc = SparkContext(appName='Kafkas')
    ssc = StreamingContext(sc, 2)
    brokers, topic = sys.argv[1:]
    kvs = KafkaUtils.createDirectStream(ssc, [topic], 
                                       {'metadata.broker.list': brokers})
    lines = kvs.map(lambda x: x[1])
    lines.pprint()
    ssc.start()
    ssc.awaitTermination()

Also I use this to start spark streaming:

spark-submit --packages org.apache.spark:spark-streaming-kafka-0–8_2.11:2.0.0 
spark-kafka.py localhost:9092 topic
Burak
  • 53
  • 6
  • I can answer this but I'll need more details. What is your current code ? Where are you blocked ? What do you need ? Where do you wanna go ? – LaSul Dec 19 '18 at 09:58
  • 1
    @LaSul I added more information in question. I use tshark to sniff network packages and then I feed the data into Kafka in realtime. Kafka sends the data into Spark Streaming to be able to flow data and process it in realtime. Overall objective is a machine learning pipeline that works in realtime on a big data. I stuck at using processed data (DStreams) in Tensorflow in realtime. – Burak Dec 19 '18 at 11:05

1 Answers1

1

You have two ways to solve your problem :

  1. Once your processed your data, you can save them, then independently run your model (in Keras ?). Just create a parquet file / append to it if it already exists :

    if os.path.isdir(DATA_TREATED_PATH):
        data.write.mode('append').parquet(DATA_TREATED)
    else:
        data.write.parquet(DATA_TREATED_PATH)
    

And then you just create your model with keras / tensorflow and you run it like every hour maybe ? Or as many time as you want it to be updated. So this is run from scratch everytime.

  1. You process your data, save them as before but after that, you load you model, train your new data / new batch and then save your model. This is called Online Learning because you don't run your model from scratch.
LaSul
  • 2,231
  • 1
  • 20
  • 36
  • Tell me if you need more detail but you will find how to run / load keras models easily – LaSul Dec 19 '18 at 12:58
  • I solved the issue by saveAsTextFiles and merge them in another spark job. Parquet doesn't work with KafkaDStreams in my application and gives error – Burak Dec 25 '18 at 13:42