I wish to read documents from a MongoDB database into a PySprak DataFrame in a truly schema-less way, as part of the bronze layer of a DataLake architecture on DataBricks. This is important since I want no schema inference or assupmtions to be made there (see the below architecture).
This is an implementation of the idea presented in the following blog post: https://www.databricks.com/blog/2022/09/07/parsing-improperly-formatted-json-objects-databricks-lakehouse.html
There, raw JSON documents are read into a DeltaTable of the simple ["value", "time_stamp"]
schema:
This allows schema mismatch, duplicate columns and heavily nested files to live together in the schema-less bronze table. This, in turn, allows us to deal with any backward-compatability and schema-breaking changes in MongoDB, as well as perform the enforcement of a meaningful schema, on the bronze-to-silver ETL logic instead of in the bronze one (where we will lose information, and the adherence to the principle of the bronze level as a raw data storage level).
However, we have not been able to reproduce the same schema-less read logic implemented here from JSON files when reading from a MongoDB database using the PySpark MongoDB connector:
The option("cloudFiles.format", "text")
option does not exist when using the "mongodb"
read/stream format, and we do not know how to get the whole BSON document as a single string inserted into the "value"
column of our created DataFrame (soon to be written to a DeltaTable).
Thank you for your help, Shay.