How to save a processed kafka DStream in a text file in pyspark?

Question

I read a data from a kafka topic in spark and create a DStream and then process it using a couple of user-defined functions and I'd like to save the result into a text file. I tried the saveRec function which is already implemented but it's not working. It prints weird characters into the text file.

However it works fine when I just print out the result into the console using pprint().

Out put to the console using pprint():

[80 81]

[233 234]

[273 273]

[469 469]

[621 621]

[667 668]

[809 809]

[926 927]

[935 936]

[1001 1001]

from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

if __name__ == "__main__":
    print("hello spark")

    sc = SparkContext(appName="STALTA")
    ssc = StreamingContext(sc, 1)
    broker, topic = sys.argv[1:]
    # Connect to Kafka

    kvs = KafkaUtils.createStream(ssc, broker, "raw-event-streaming-consumer",{topic:1})

    lines = kvs.map(lambda x: x[1])
    ds = lines.flatMap(lambda line: line.strip().split("\n")).map(lambda strelem: float(strelem))

    mapped = ds.mapPartitions(lambda i: classic_sta_lta_py(np.array(list(i))))


    mapped1 = mapped.mapPartitions(lambda j: trigger_onset(np.array(list(j))))
    def saveRec(rdd):
        rdd.foreach(lambda rec: open("/Users/zeinab/kafka_2.11-1.1.0/outputFile.txt", "a").write(rec))

    mapped1.pprint()

    mapped1.foreachRDD(saveRec)


    ssc.start()
    ssc.awaitTermination()

Does anybody know what is the problem?

You're writing out binary data. You'll need to decode it to a String — OneCricketeer, Jul 31 '18 at 04:26
@cricket_007 How can I do that? I tried these ones but non of them worked. def saveRec(rdd): rdd.foreach(lambda rec: open("/Users/zeinab/kafka_2.11-1.1.0/outputFile.txt", "a").write(np.char.decode(rec)) OR def saveRec(rdd): rdd.foreach(lambda rec: open("/Users/zeinab/kafka_2.11-1.1.0/outputFile.txt", "a").write(rec.astype('U13')) — Zeinab Akhavan, Jul 31 '18 at 17:56
Something like `map(lambda x: x.decode('utf-8')`, I guess. I typically use the standard Kafka API's, not Spark — OneCricketeer, Jul 31 '18 at 18:44
@cricket_007 Thank you. I tried that and I'm getting the following error: `AttributeError: 'numpy.ndarray' object has no attribute 'map' ` Function: `def saveRec(rdd): rdd.foreach(lambda rec: open("/Users/zeinab/kafka_2.11-1.1.0/outputFile.txt", "a").write(rec.map(lambda x: x.decode('utf-8'))))` — Zeinab Akhavan, Jul 31 '18 at 21:45
No, before all that... `lines = kvs.map(lambda x: x[1].decode('utf-8'))` — OneCricketeer, Jul 31 '18 at 22:37
If all you want to do is save to a file from Kafka, you know you can just use `kafka-console-consumer`, right? — OneCricketeer, Jul 31 '18 at 22:39
@cricket_007 I want to save the processed DStream called "mapped1" to a text file. I tried `lines = kvs.map(lambda x: x[1].decode('utf-8'))` but it still prints out weird characters not string. Thank you! — Zeinab Akhavan, Aug 01 '18 at 16:59
I'm not sure I really understand what you need Spark for in order to do this. Just use a plain Kafka consumer. — OneCricketeer, Aug 01 '18 at 17:10
@cricket_007Could you explain it in more details that how I save mapped1 into a text file using kafka consumer? Because kafka consumer contains "kvs" not "mapped1" .I'm new to kafka and spark. Thank you! — Zeinab Akhavan, Aug 01 '18 at 17:16
@cricket_007I want to do somthing like the link below but it's not working in my case: (https://stackoverflow.com/questions/41325355/pyspark-processing-stream-data-and-saving-processed-data-to-file) — Zeinab Akhavan, Aug 01 '18 at 17:40
@cricket_007I think I solved it by adding this line: `mapped2 = mapped1.map(lambda m: str(m))` `mapped2.foreachRDD(saveRec)` But it's not printing out the whole DStream. It only prints some part of it into the text file. — Zeinab Akhavan, Aug 01 '18 at 18:03

How to save a processed kafka DStream in a text file in pyspark?

0 Answers0