Spark2 mongodb connector polymorphic schema

Question

I have collection col that contains

{
   '_id': ObjectId(...)
   'type': "a"
   'f1': data1
}

on same collection i have

{
   '_id': ObjectId(...)
   'f2': 222.234
   'type': "b"
}

Spark MongoDB connector Is not working fine. It's reorder the data in wrong fields

for example:

{
   '_id': ObjectId(...)
   'type': "a"
   'f1': data1
}


{
   '_id': ObjectId(...)
   'f1': data2
   'type': "a"
}

Rdd will be:

------------------------
|  id  |  f1   | type  |
------------------------
| .... |  a    | data1 |
| .... | data2 | a     |
------------------------

Is there any suggestions working with polymorphic schema

score 0 · Accepted Answer · answered Dec 13 '17 at 13:15

Is there any suggestions working with polymorphic schema

(Opinion alert) The best suggestion is not to have one in the first place. It is impossible to maintain in the long term, extremely error prone and requires complex compensation on the client side.

What to do if you have one:

You can try using Aggregation Framework with $project to sanitize data before it is fetched to Spark. See Aggregation section of the docs for example.
Don't try to couple it with structured format. Use RDDs, fetch data as plain Python dict and deal with the problem manually.

Aggregation is good idea. i thought to use in `PySpark` with `pyMongo` and create `rdd` with map-table process — Yehuda, Dec 13 '17 at 17:09

Spark2 mongodb connector polymorphic schema

1 Answers1