MongoDB to Databricks Data Ingestion

Question

I am working on creating a pipeline from MongoDB to Databricks. Based on my research there are two ways of doing it:

MongoDB Change Streams
MongoDB-Databricks Connecor for Structured Streaming.

I am using Pyspark.

I am doing this to get all the collections from the database:


    database = connection[database_name]
    collections_name_list = database.list_collection_names()

And then this to ingest data:


    def doThis(batch_data, batch_id):
        batch_data.write.mode("append").format("delta").saveAsTable("demo.practice_schema.demo_mongodb")
    
    dataStreamWriter = (spark.readStream
                            .format("mongodb") \
                            .option("spark.mongodb.connection.uri", connection_string) \
                            .option('spark.mongodb.database', database_name) \
                            .option('spark.mongodb.collection', current_collection_name) \
                            .option('spark.mongodb.change.stream.publish.full.document.only','true') \
                            .option("pipeline", "[{'$match':{'operationType':{'$in': ['insert', 'update', 'replace']} }},{'$project':{'fullDocument':1}}]") \
                            .load() \
                            .writeStream \
                           .foreachBatch(doThis) \
                            .start())

The readstream works fine, But I have an issue while writing to DeltaTable:

py4j.protocol.Py4JJavaError: An error occurred while calling o703.saveAsTable.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 13727.0 failed 4 times, most recent failure: Lost task 0.3 in stage 13727.0 (TID 24804) (10.103.4.5 executor 0): com.mongodb.spark.sql.connector.exceptions.MongoSparkException: Could not create the change stream cursor.
    at com.mongodb.spark.sql.connector.read.MongoMicroBatchPartitionReader.getCursor(MongoMicroBatchPartitionReader.java:184)
    at com.mongodb.spark.sql.connector.read.MongoMicroBatchPartitionReader.next(MongoMicroBatchPartitionReader.java:99)
    at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:120)
    at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:158)
    at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:63)
    at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:63)
    at scala.Option.exists(Option.scala:376)
    at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
    at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:97)
    at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:761)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:761)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:464)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeWrite$3(FileFormatWriter.scala:316)
    at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
    at org.apache.spark.scheduler.Task.doRunTask(Task.scala:174)
    at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:142)
    at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:41)
    at com.databricks.unity.HandleImpl.runWith(UCSHandle.scala:99)
    at com.databricks.unity.HandleImpl.$anonfun$runWithAndClose$1(UCSHandle.scala:104)
    at scala.util.Using$.resource(Using.scala:269)
    at com.databricks.unity.HandleImpl.runWithAndClose(UCSHandle.scala:103)
    at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:142)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.Task.run(Task.scala:97)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:904)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1713)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:907)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:761)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
Caused by: com.mongodb.MongoCommandException: Command failed with error 2 (BadValue): 'Change stream must be followed by a match and then a project stage' on server wd-dev-cosmosdb-serverless-eastus.mongo.cosmos.azure.com:10255. The full response is {"ok": 0.0, "errmsg": "Change stream must be followed by a match and then a project stage", "code": 2, "codeName": "BadValue"}

Can Someone help me out?

Buzz Moschetti · Answer 1 · 2023-08-08T12:53:03.813

0

Based on https://www.mongodb.com/docs/spark-connector/current/configuration/read/

try changing

   .option("pipeline", "[{'$match':{ ...

to

   .option("aggregation.pipeline", "[{'$match':{ ...

edited Aug 08 '23 at 12:53

answered Aug 08 '23 at 12:47

Buzz Moschetti

7,057
3
23
33

MongoDB to Databricks Data Ingestion

1 Answers1