1

I'm using spark to read from MongoDb using the MongoDb Connector for Spark. But I'm having issues with nested documents. For example, my MongoDb documents looks like this:

/* 1 */
 {
    "_id": "user001",
    "group": 100,
    "profile": {
      "age": 21,
      "fname": "John",
      "lname": "Doe",
      "email": "johndoe@example.com"
    }
  }
/* 2 */
  {
    "_id": "user002",
    "group": 400,
    "profile": {
      "age": 23,
      "fname": "Jane",
      "lname": "Doe",
      "email": "janedoe@example.com"
    }
  }

And a class as follows:

case class User(_id: String, group: Option[Long], profile: Map[String, Option[String]])    

val spark = SparkSession
    .builder
    .appName("test-app")
    .enableHiveSupport()
    .getOrCreate()

import spark.implicits._    
val readConfig = ReadConfig(Map("uri" -> xxxx,"collection" -> xxx,"database" -> xxxx))

val userMongoDF: DataFrame =  MongoSpark.load[User](spark, readConfig)
val userDF: DataFrame = userMongoDF.filter(userMongoDF("group") > 50)

userDF.printSchema()
userDF.show()

which gives the following:

root
 |-- _id: string (nullable = true)
 |-- group: long (nullable = true)
 |-- profile: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)



+-------+------+------------------------------------------------------------------------+
|_id    |group |profile                                                                 |                                                                                                                                                                                                                                                                                       |createdWhen            |updatedByUser                                                                                                                                                                                                                                                                                                                        |updatedWhen            |realizedWhen         |opp_source  |opp_owner                                                                                                                                                                                                                                                                                             |
+-------+------+------------------------------------------------------------------------+
|user001|100   |Map(age -> 21, email -> johndoe@example.com, fname -> John, lname -> Doe|
|user002|400   |Map(age -> 23, email -> janedoe@example.com, fname -> Jane, lname -> Doe|
+-------+------+------------------------------------------------------------------------+   

But when I select the nested field profile, I encounter this error message:

userDF.select(col("_id"),col("profile.email").cast(StringType) as "email").show()

ERROR yarn.ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 (TID 437, XXXXX, executor 7): org.bson.BsonInvalidOperationException: Invalid state INITIAL

unlike select without profile which works fine: userDF.select(col("_id")).show()

Spark version 2.1.0 and MongDb 3.6.19

DevEx
  • 4,337
  • 13
  • 46
  • 68
  • Does this help: https://stackoverflow.com/questions/55878401/caused-by-org-bson-bsoninvalidoperationexception-invalid-state-initial – mck Dec 07 '20 at 11:13
  • @mck: I didnt quite get anything from the comments. – DevEx Dec 07 '20 at 21:34

0 Answers0