Read entire MongoDB document to a PySpark DataFrame as a single text colum

Question

I wish to read documents from a MongoDB database into a PySprak DataFrame in a truly schema-less way, as part of the bronze layer of a DataLake architecture on DataBricks. This is important since I want no schema inference or assupmtions to be made there (see the below architecture).

This is an implementation of the idea presented in the following blog post: https://www.databricks.com/blog/2022/09/07/parsing-improperly-formatted-json-objects-databricks-lakehouse.html

There, raw JSON documents are read into a DeltaTable of the simple ["value", "time_stamp"] schema:

This allows schema mismatch, duplicate columns and heavily nested files to live together in the schema-less bronze table. This, in turn, allows us to deal with any backward-compatability and schema-breaking changes in MongoDB, as well as perform the enforcement of a meaningful schema, on the bronze-to-silver ETL logic instead of in the bronze one (where we will lose information, and the adherence to the principle of the bronze level as a raw data storage level).

However, we have not been able to reproduce the same schema-less read logic implemented here from JSON files when reading from a MongoDB database using the PySpark MongoDB connector:

The option("cloudFiles.format", "text") option does not exist when using the "mongodb" read/stream format, and we do not know how to get the whole BSON document as a single string inserted into the "value" column of our created DataFrame (soon to be written to a DeltaTable).

Thank you for your help, Shay.

score 0 · Answer 1 · answered Jun 29 '23 at 08:58

We have elected to solve this by moving our bronze-level ingestion ETL to a Scala-based notebook:

import java.time.LocalDate
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

// define custom MongoDB read configuration
val readConfig = ReadConfig(Map("uri" -> uri))

// load data from MongoDB into an RDD
val mongoRDD = MongoSpark.load(sc, readConfig)

// get today's date
val today = LocalDate.now.toString

// create an RDD of tuples, where each tuple contains a JSON string and today's date
val jsonAndDateRDD = mongoRDD.map(document => (document.toJson(), today))

// define the schema for the DataFrame
val schema = StructType(List(
  StructField("Document", StringType, true),
  StructField("Date", StringType, true)
))

// convert the RDD to a DataFrame
val df = spark.createDataFrame(jsonAndDateRDD.map(Row.fromTuple(_)), schema)

Once we have a dataframe of the chosen schema, the problem is solved, and it is trivial to write it to a Databricks DeltaTable from there.

Read entire MongoDB document to a PySpark DataFrame as a single text colum

1 Answers1