2

Spark dataframe can be written into mongodb collection. Refer - https://docs.mongodb.com/spark-connector/master/python/write-to-mongodb/

But when tried to write spark structure stream into mongodb collection, it is not working.

Can you please suggest any better option to achive this than using pymongo code in udf.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
avikm
  • 511
  • 1
  • 7
  • 23

2 Answers2

2

it is resolved using foreachBatch sink. PFB working sample code.

def write_mongo_row(df, epoch_id):
    mongoURL = "mongodb://XX.XX.XX.XX:27017/test.collection"
    df.write.format("mongo").mode("append").option("uri",mongoURL).save()
    pass

query=csvDF.writeStream.foreachBatch(write_mongo_row).start()
query.awaitTermination()

got idea from How to use foreach or foreachBatch in PySpark to write to database?

avikm
  • 511
  • 1
  • 7
  • 23
  • Is there another alternative, other than initializing the mongo connection each time the function is called? – Induraj PR Apr 08 '21 at 11:55
1

Sharing an alternative solution where config part is taken care at very beginning rather than handling configs later in save method (to seperate configs from logic).

def save(message: DataFrame):
    message.write \
        .format("mongo") \
        .mode("append") \
        .option("database", "db_name") \
        .option("collection", "collection_name") \
        .save()
    pass

spark: SparkSession = SparkSession \
    .builder \
    .appName("MyApp") \
    .config("spark.mongodb.input.uri", "mongodb://localhost:27017") \
    .config("spark.mongodb.output.uri", "mongodb://localhost:27017") \
    .config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.12:3.0.1") \
    .master("local") \
    .getOrCreate()

df: DataFrame = spark \
    .readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .load()

query: StreamingQuery = df\
    .writeStream \
    .foreachBatch(save) \
    .start()

query.awaitTermination()
Pardeep
  • 945
  • 10
  • 18
  • Support for Apache Structured Streaming was added recently in version 10 of the MongoDB Spark Connector. See https://www.mongodb.com/blog/post/new-mongodb-spark-connector https://www.mongodb.com/blog/post/introducing-mongodb-spark-connector-version-10-1? and https://www.mongodb.com/docs/spark-connector/current/structured-streaming/ – Robert Walters Jan 22 '23 at 21:29