The accepted answer seem to be outdated as of today. It really led me to a working version, thank you.
Here is my working version of the code :
import pyspark.sql.functions as sfunc
from pyspark.sql.types import *
# This user defined function creates from an str ID like "5b8f7fe430c49e04fdb91599"
# the following Object : { "oid" : "5b8f7fe430c49e04fdb91599"}
# which will be recognized as an ObjectId by MongoDB
udf_struct_id = sfunc.udf(
lambda x: tuple((str(x),)),
StructType([StructField("oid", StringType(), True)])
)
df = df.withColumn('future_object_id_field', udf_struct_id('string_object_id_column'))
My setup : MongoDB 4.0, Docker image for Spark gettyimages/spark:2.3.1-hadoop-3.0
, python 3.6
The documentation for the pyspark mongo connector gave me the idea to call the field oid
, which is needed for mongo to recognize fields as ObjectId type.