3

I'm currently in the process of integrating MongoDB database into BigQuery through Data Fusion and I'm facing an issue with array objects. It seems like Data Fusion doesn't understand or support such data types. However it seems to be a feature that I could use and here's where I get confused. Let me give you an example based on one row of data.

Order_number, items
43324         [{"name":"x","price":"10"},{"name":"y","price":"20"}]

From what I've been testing, the only thing that I can do (that works) is to flatten items, then parse it as json and achieve such result:

Order_number, items_name, items_price
43324              x           10
43324              y           20

However, I'm trying to avoid duplicating the order number and I want to leave items as array to have this kind of result in one row:

Order_number,       items_name,      items_price
43324                 x                   10
                      y                   20

Here's my workflow in Data Fusion

enter image description here

And here's my output schema generated in wrangler (I've added only name for testing purposes): enter image description here

I propagated this schema into BigQuery and here's an error that I'm getting:

Spark program 'phase-1' failed with error: null.

So clearly, there's an option to work with arrays in Data Fusion but something seems to be off and I don't know what. I've tested every single possiblity or combination and I wasn't able to achieve any positive result.

The alternative approach that gave any reasonable error was this: enter image description here

And this returns the following error:

Spark program 'phase-1' failed with error: Unable to decode array 'items'

Maybe I need to add some directives or another plugin in Data Fusion but honestly I'm running out of ideas. I'd aprreciate any suggestions on the matter.

1 Answers1

1

The wrangler has a problem processing nested arrays/records, couldn't manage to make it work. My recommendation, use a JS transform to process the data and generate the right json structure to be loaded on bigquery.

Send all data from gcs in one line to the js transform and split/parse the json, generate an object as you want to send to the BQ sink. I haven't processed data on it, but I imagine it will be a workaround due to the wrangler limitation.

enter image description here

ewertonvsilva
  • 1,795
  • 1
  • 5
  • 15
  • Thank you for your answer. Should it look like this?: function transform(input, emitter, context) { var items = input.items; items = JSON.parse() emitter.emit( { 'Order_number': input.Order_number, 'items': items } ); } – Marcin Stańczak Mar 26 '22 at 07:30
  • Hi @MarcinStańczak, have you managed to make it work? If not, could you please send an example file hosted on GCS (as presented in your workflow) so we can debug your code. – Mikolaj S. Apr 07 '22 at 08:26