I am hoping to read a parquet file and write to Kafka
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.struct
import org.apache.spark.sql.functions.to_json
object IngestFromS3ToKafka {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession
.builder()
.master("local[*]")
.appName("ingest-from-s3-to-kafka")
.config("spark.ui.port", "4040")
.getOrCreate()
val filePath = "s3a://my-bucket/my.parquet"
spark.read.parquet(filePath)
.select(to_json(struct("*")).alias("value"))
.write
.format("kafka")
.option("kafka.bootstrap.servers", "hm-kafka-kafka-bootstrap.hm-kafka.svc:9092")
.option("topic", "my-topic")
.save()
spark.stop()
}
}
Based on Structured Streaming + Kafka Integration Guide, it seems I should use library spark-sql-kafka-0-10
which can do both batch processing and streaming.
Then I found these two libraries:
- spark-streaming-kafka-0-10: Spark Integration For Kafka 0.10
- spark-sql-kafka-0-10: Kafka 0.10+ Source For Structured Streaming
In my case, it is about batch instead of streaming. However, based on their names and descriptions, both seem related with streaming. What is difference between these two libraries?
Is there any document regarding to their difference? Thanks!