How can I connect Azure Databricks to Cosmos DB using MongoDB API?

Question

I have created one azure CosmosDB account using MongoDB API. I need to connect CosmosDB(MongoDB API) to Azure Databricks cluster in order to read and write data from cosmos.

How to connect Azure Databricks cluster to CosmosDB account?

score 1 · Answer 1 · answered Jan 22 '19 at 13:08

Here is the pyspark piece of code I use to connect to a CosmosDB database using MongoDB API from Azure Databricks (5.2 ML Beta (includes Apache Spark 2.4.0, Scala 2.11) and MongoDB connector: org.mongodb.spark:mongo-spark-connector_2.11:2.4.0 ):

from pyspark.sql import SparkSession

my_spark = SparkSession \
    .builder \
    .appName("myApp") \
    .getOrCreate()

df = my_spark.read.format("com.mongodb.spark.sql.DefaultSource") \
  .option("uri", CONNECTION_STRING) \
  .load()

With a CONNECTION_STRING that looks like that: "mongodb://USERNAME:PASSWORD@testgp.documents.azure.com:10255/DATABASE_NAME.COLLECTION_NAME?ssl=true&replicaSet=globaldb"

I tried a lot of different other options (adding database and collection names as option or config of the SparkSession) without success. Tell me if it works for you...

You can also specify database and collection via options in the latest spark. — hui chen, Jun 19 '20 at 09:19

score 0 · Answer 2 · answered Feb 17 '19 at 13:59

After adding the org.mongodb.spark:mongo-spark-connector_2.11:2.4.0 package, this worked for me:

import json

query = {
  '$limit': 100,
}

query_config = {
  'uri': 'myConnectionString'
  'database': 'myDatabase',
  'collection': 'myCollection',
  'pipeline': json.dumps(query),
}

df = spark.read.format("com.mongodb.spark.sql") \
  .options(**query_config) \
  .load()

I do, however, get this error with some collections:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, 10.139.64.6, executor 0): com.mongodb.MongoInternalException: The reply message length 10168676 is less than the maximum message length 4194304

Also got the Job aborted due to stage failure exception – tatigo May 22 '19 at 16:22 — tatigo, May 22 '19 at 16:22

score 0 · Answer 3 · answered Jul 01 '20 at 17:48

Answering the same way I did to my own question.

Using MAVEN as the source, I installed the right library to my cluster using the path

org.mongodb.spark:mongo-spark-connector_2.11:2.4.0

Spark 2.4

An example of code I used is as follows (for those who wanna try):

# Read Configuration
readConfig = {
    "URI": "<URI>",
    "Database": "<database>",
    "Collection": "<collection>",
  "ReadingBatchSize" : "<batchSize>"
  }


pipelineAccounts = "{'$sort' : {'account_contact': 1}}"

# Connect via azure-cosmosdb-spark to create Spark DataFrame 
accountsTest = (spark.read.
                 format("com.mongodb.spark.sql").
                 options(**readConfig).
                 option("pipeline", pipelineAccounts).
                 load())

accountsTest.select("account_id").show()

How can I connect Azure Databricks to Cosmos DB using MongoDB API?

3 Answers3