0

I wish to read documents from a MongoDB database into a PySprak DataFrame in a truly schema-less way, as part of the bronze layer of a DataLake architecture on DataBricks. This is important since I want no schema inference or assupmtions to be made there (see the below architecture).

enter image description here

This is an implementation of the idea presented in the following blog post: https://www.databricks.com/blog/2022/09/07/parsing-improperly-formatted-json-objects-databricks-lakehouse.html

There, raw JSON documents are read into a DeltaTable of the simple ["value", "time_stamp"] schema:

enter image description here

This allows schema mismatch, duplicate columns and heavily nested files to live together in the schema-less bronze table. This, in turn, allows us to deal with any backward-compatability and schema-breaking changes in MongoDB, as well as perform the enforcement of a meaningful schema, on the bronze-to-silver ETL logic instead of in the bronze one (where we will lose information, and the adherence to the principle of the bronze level as a raw data storage level).

However, we have not been able to reproduce the same schema-less read logic implemented here from JSON files when reading from a MongoDB database using the PySpark MongoDB connector:

enter image description here

The option("cloudFiles.format", "text") option does not exist when using the "mongodb" read/stream format, and we do not know how to get the whole BSON document as a single string inserted into the "value" column of our created DataFrame (soon to be written to a DeltaTable).

Thank you for your help, Shay.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132

1 Answers1

0

We have elected to solve this by moving our bronze-level ingestion ETL to a Scala-based notebook:

import java.time.LocalDate
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

// define custom MongoDB read configuration
val readConfig = ReadConfig(Map("uri" -> uri))

// load data from MongoDB into an RDD
val mongoRDD = MongoSpark.load(sc, readConfig)

// get today's date
val today = LocalDate.now.toString

// create an RDD of tuples, where each tuple contains a JSON string and today's date
val jsonAndDateRDD = mongoRDD.map(document => (document.toJson(), today))

// define the schema for the DataFrame
val schema = StructType(List(
  StructField("Document", StringType, true),
  StructField("Date", StringType, true)
))

// convert the RDD to a DataFrame
val df = spark.createDataFrame(jsonAndDateRDD.map(Row.fromTuple(_)), schema)

Once we have a dataframe of the chosen schema, the problem is solved, and it is trivial to write it to a Databricks DeltaTable from there.