I am using mongo-spark-connector_2.12:10.1.1
and I'm trying to save a dataframe to MongoDB.
Here are my Mongodb write configuration and code:
def write_mongo_batches(self,df):
return df.writeStream.option('checkpointLocation',self.checkpointLocation).foreachBatch(self.write_mongo_collection).start()
def write_mongo_collection(self,df,epoch_id):
df\
.write\
.format('mongodb') \
.mode('append') \
.option('database',self.database)\
.option('collection', self.collection_name)\
.option('idFieldList',self.shardkey)\
.option('operationType','update')\
.save()
My input is a postgres table with 2 columns collection_id
and subscriber_jid
.
My spark transformation code looks like this:
def create_subscriptions_table(df):
split_array = F.split(F.col('collection_id'),'\\.')
return df\
.filter(df.collection_id.isNotNull() & df.subscriber_jid.isNotNull())\
.withColumn('namespaceId',F.when(F.size(split_array) > 1,split_array.getItem(0)).otherwise("DEFAULT"))\
.withColumn('collectionId',F.col('collection_id'))\
.withColumn('subscriberIds',F.array(F.col('subscriber_jid')))\
.withColumn('_id',F.col('collection_id'))\
.select(F.col('_id'),F.col('namespaceId'),F.col('collectionId'),F.col('subscriberIds'))
My output MongoDB document looks like this where subscriberIds
is an array:
{
"_id": "collectionId",
"collectionId": "collectionId",
"namespaceId": "DEFAULT",
"subscriberIds": [
"subscriberId1"
]
}
I am trying to append the subscriber_jid
from the input and append it to subscriberIds
array in the output.
My problem is that I am not able to append the subscriberIds
to the array. It is replacing the document as a whole and subscriberIds
array always has 1 entry.
What am I doing wrong? How can I append another subscriber_jid
to subscriberIds
array?
There is a similar question on SO but it's unanswered.