I read a data from a kafka topic in spark and create a DStream and then process it using a couple of user-defined functions and I'd like to save the result into a text file. I tried the saveRec function which is already implemented but it's not working. It prints weird characters into the text file.
However it works fine when I just print out the result into the console using pprint().
Out put to the console using pprint():
[80 81]
[233 234]
[273 273]
[469 469]
[621 621]
[667 668]
[809 809]
[926 927]
[935 936]
[1001 1001]
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
if __name__ == "__main__":
print("hello spark")
sc = SparkContext(appName="STALTA")
ssc = StreamingContext(sc, 1)
broker, topic = sys.argv[1:]
# Connect to Kafka
kvs = KafkaUtils.createStream(ssc, broker, "raw-event-streaming-consumer",{topic:1})
lines = kvs.map(lambda x: x[1])
ds = lines.flatMap(lambda line: line.strip().split("\n")).map(lambda strelem: float(strelem))
mapped = ds.mapPartitions(lambda i: classic_sta_lta_py(np.array(list(i))))
mapped1 = mapped.mapPartitions(lambda j: trigger_onset(np.array(list(j))))
def saveRec(rdd):
rdd.foreach(lambda rec: open("/Users/zeinab/kafka_2.11-1.1.0/outputFile.txt", "a").write(rec))
mapped1.pprint()
mapped1.foreachRDD(saveRec)
ssc.start()
ssc.awaitTermination()
Does anybody know what is the problem?