1

I am trying to store Apache Spark Dataframe into MongoDB using Scala but getting Caused by: org.bson.BsonMaximumSizeExceededException: Payload document size is larger than maximum of 16777216. exception while storing dataframe into MongoDB

Code Snippet:

 val spark = SparkSession.builder()
      .appName("User Network Graph")
      .config("spark.mongodb.input.uri", "mongodb://mongo/socio.d3raw")
      .config("spark.mongodb.output.uri", "mongodb://mongo/socio.d3raw")
      .master("yarn").getOrCreate()

 val rawD3str=seqGraph.toDF()

 MongoSpark.write(rawD3str).option("spark.mongodb.output.uri", "mongodb://mongo/socio" 
   ).option("collection","d3raw").mode("append").save()

Error stack trace 0 failed 4 times, most recent failure: Lost task 0.3 in stage 332.0 (TID 11617, hadoop-node022, executor 1): org.bson.BsonMaximumSizeExceededException: Payload document size is larger than maximum of 16777216. at com.mongodb.internal.connection.BsonWriterHelper.writePayload(BsonWriterHelper.java:68) at com.mongodb.internal.connection.CommandMessage.encodeMessageBodyWithMetadata(CommandMessage.java:147) at com.mongodb.internal.connection.RequestMessage.encode(RequestMessage.java:138) at com.mongodb.internal.connection.CommandMessage.encode(CommandMessage.java:61) at com.mongodb.internal.connection.InternalStreamConnection.sendAndReceive(InternalStreamConnection.java:248) at com.mongodb.internal.connection.UsageTrackingInternalConnection.sendAndReceive(UsageTrackingInternalConnection.java:99) at com.mongodb.internal.connection.DefaultConnectionPool$PooledConnection.sendAndReceive(DefaultConnectionPool.java:450) at com.mongodb.internal.connection.CommandProtocolImpl.execute(CommandProtocolImpl.java:72) at com.mongodb.internal.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:226) at com.mongodb.internal.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:269) at com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:131) at com.mongodb.operation.MixedBulkWriteOperation.executeCommand(MixedBulkWriteOperation.java:435) at com.mongodb.operation.MixedBulkWriteOperation.executeBulkWriteBatch(MixedBulkWriteOperation.java:261) at com.mongodb.operation.MixedBulkWriteOperation.access$700(MixedBulkWriteOperation.java:72) at com.mongodb.operation.MixedBulkWriteOperation$1.call(MixedBulkWriteOperation.java:205) at com.mongodb.operation.MixedBulkWriteOperation$1.call(MixedBulkWriteOperation.java:196) at com.mongodb.operation.OperationHelper.wi

ameen
  • 41
  • 2
  • 4
  • How to overcome this MongoDB size limit while working with Apache Spark ? Can we use GridFS to store dataframe with size > 16MB – ameen Mar 05 '20 at 12:46

1 Answers1

1

MongoDB has a 16MB document size limit. See https://docs.mongodb.com/manual/core/document/#document-size-limit for more details. It sounds like what you're trying to save is larger than 16 MB.

Lauren Schaefer
  • 696
  • 4
  • 9
  • ..Looking for a solution to overcome MongoDB's size limit issue which i am getting while storing Apache Spark Dataframe into MongoDB – ameen Mar 05 '20 at 12:53
  • I don't have experience with Apache Spark Dataframe, so I'm not sure exactly what you're trying to import. One thing you could try is using GridFS: https://docs.mongodb.com/manual/core/gridfs/. Another option is store large files outside of the database in something like an S3 bucket: https://www.mongodb.com/blog/post/handling-files-using-mongodb-stitch-and-aws-s3 – Lauren Schaefer Mar 05 '20 at 16:40