Pyspark Mongodb connector append to array

Question

I am using mongo-spark-connector_2.12:10.1.1 and I'm trying to save a dataframe to MongoDB. Here are my Mongodb write configuration and code:


    def write_mongo_batches(self,df): 
        return      df.writeStream.option('checkpointLocation',self.checkpointLocation).foreachBatch(self.write_mongo_collection).start()
    

    def write_mongo_collection(self,df,epoch_id):
        df\
            .write\
            .format('mongodb') \
            .mode('append') \
            .option('database',self.database)\
            .option('collection', self.collection_name)\
            .option('idFieldList',self.shardkey)\
            .option('operationType','update')\
            .save()

My input is a postgres table with 2 columns collection_id and subscriber_jid.

My spark transformation code looks like this:

def create_subscriptions_table(df):
    split_array = F.split(F.col('collection_id'),'\\.')
    return df\
        .filter(df.collection_id.isNotNull() & df.subscriber_jid.isNotNull())\
        .withColumn('namespaceId',F.when(F.size(split_array) > 1,split_array.getItem(0)).otherwise("DEFAULT"))\
        .withColumn('collectionId',F.col('collection_id'))\
        .withColumn('subscriberIds',F.array(F.col('subscriber_jid')))\
        .withColumn('_id',F.col('collection_id'))\
        .select(F.col('_id'),F.col('namespaceId'),F.col('collectionId'),F.col('subscriberIds'))

My output MongoDB document looks like this where subscriberIds is an array:

{
  "_id": "collectionId",
  "collectionId": "collectionId",
  "namespaceId": "DEFAULT",
  "subscriberIds": [
    "subscriberId1"
  ]
}

I am trying to append the subscriber_jid from the input and append it to subscriberIds array in the output.

My problem is that I am not able to append the subscriberIds to the array. It is replacing the document as a whole and subscriberIds array always has 1 entry.

What am I doing wrong? How can I append another subscriber_jid to subscriberIds array?

There is a similar question on SO but it's unanswered.

You may need to do group by operation on the dataframe and populate the array and write it to mongodb — M.Navy, May 09 '23 at 06:44

Pyspark Mongodb connector append to array

0 Answers0