0

I need to collect a few key pieces of information from a large number of somewhat complex nested JSON messages which are evolving over time. Each message refers to the same type of event but the messages are generated by several producers and come in two (and likely more in the future) schemas. The key information from each message is similar but the mapping to those fields is dependent on the message type.

I can’t share the actual data but here is an example:

Message A
—header:
|—attribute1
|—attribute2
—typeA:
|—typeAStruct1:
||—property1
|-typeAStruct2:
||-property2


Message B
-attribute1
-attribute2
-contents:
|-message:
||-TypeB:
|||-property1
|||-TypeBStruct:
||||-property2

I want to produce a table of data which looks something like this regardless of message type:

| MessageSchema | Property1 | Property2 |
| :———————————- | :———————— | :———————— |
| MessageA      | A1        | A2        |
| MessageB      | B1        | B2        |
| MessageA      | A3        | A4        |
| MessageB      | B3        | B4        |

My current strategy is read the data with schema A and union with the data read with Schema B. Then I can filter the nulls that result from parsing a type A message with a B schema and vice versa. This seems very inefficient especially once a third or fourth schema emerge. I would like to be able to parse the message correctly on the first pass and apply the correct schema.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • If you want data that evolves over time, I might suggest using Avro rather than JSON. – OneCricketeer Mar 06 '18 at 04:09
  • will your `few key pieces` change over time or is it some constant schema that will be applicable to all future messages? – Vladislav Varslavans Mar 06 '18 at 12:39
  • @VladislavVarslavans the pieces of information might change as well. Also, different schemas could refer to the same value with a slight different name. – gearlessSheave Mar 06 '18 at 18:46
  • I had a similar case and I manage to solve it by reading the json schema first through json rapture and then by populating columns of the dataframe dynamically. I have posted an answer also [here](https://stackoverflow.com/questions/49088401/spark-from-json-with-dynamic-schema/49222024#49222024) – abiratsis Mar 11 '18 at 16:07

1 Answers1

0

As i see it - there is only one way:

  • For each message type you create an 'adapter' that will create dataframe from input and transform it to the common schema dataframe
  • Then union outputs of the adapters

Obviously, if you change 'common' schema - you will need to tailor your 'adapters' as well.

Vladislav Varslavans
  • 2,775
  • 4
  • 18
  • 33