0

I have collection col that contains

{
   '_id': ObjectId(...)
   'type': "a"
   'f1': data1
}

on same collection i have

{
   '_id': ObjectId(...)
   'f2': 222.234
   'type': "b"
}   

Spark MongoDB connector Is not working fine. It's reorder the data in wrong fields

for example:

{
   '_id': ObjectId(...)
   'type': "a"
   'f1': data1
}


{
   '_id': ObjectId(...)
   'f1': data2
   'type': "a"
}

Rdd will be:

------------------------
|  id  |  f1   | type  |
------------------------
| .... |  a    | data1 |
| .... | data2 | a     |
------------------------

Is there any suggestions working with polymorphic schema

Yehuda
  • 457
  • 2
  • 6
  • 16

1 Answers1

0

Is there any suggestions working with polymorphic schema

(Opinion alert) The best suggestion is not to have one in the first place. It is impossible to maintain in the long term, extremely error prone and requires complex compensation on the client side.

What to do if you have one:

  • You can try using Aggregation Framework with $project to sanitize data before it is fetched to Spark. See Aggregation section of the docs for example.
  • Don't try to couple it with structured format. Use RDDs, fetch data as plain Python dict and deal with the problem manually.
  • Aggregation is good idea. i thought to use in `PySpark` with `pyMongo` and create `rdd` with map-table process – Yehuda Dec 13 '17 at 17:09