0

Please help me with copying data from Google BigQuery to Azure Data Lake Storage Gen2 with Serverless SQL Pool.

I am using Azure Synapse's Copy data pipeline. The issue is I cannot figure out how to handle source table from the BigQuery with hierarchical schema. This result in missing columns and inaccurate datetime value at the sink.

The source is a Google BigQuery table, it is made of Google Cloud Billing export of a project's standard usage cost. The source table's schema is hierarchical with nested columns, such as service.id; service.description; sku.id; sku.description; Project.labels.key; Project.labels.value, etc.

When I click on Preview data from the Source tab of the Copy data pipeline, it only gives me the top of the column hierarchy, for example: It would only show the column name of [service] and with value of {\v":{"f":[{"v":"[service.id]"},{"v":"[service.descrpition]"}]}} image description: Source with nested columns result in issues with Synapse Copy Data Pipline

I have tried to configure the Copy Pipline with the following:
Source Tab: Use query - I think the solution lays in here, but I cannot figure out the syntax of selecting the proper columns. I watched a Youtube video from TechBrothersIT How to Pass Parameters to SQL query in Azure Data Factory - ADF Tutorial 2021, but still unable to do it.

Sink Tab:

1.Sink dataset in various format of csv, json and parquet - with csv and parquet getting similar result, and json format failed

2.Sink dataset to Azure SQL Database - failed because it is not supported with Serverless SQL Pool

3.Mapping Tab: note: edited on Jan22 with screenshot to show issue.

  1. Tried with Import schemas, with Sink Tab copy behavior of None, Flatten Hierarchy and Preserve Hierarchy, but still unable to get source column to be recognized as Hierarchical. Unable to get the Collection reference nor the Advanced Editor configurations to show up. Ref: Screenshot of Source columns not detected as Hierarchical MS Doc on Schema and data type mapping in copy activity

I have also tried with the Data flow pipeline, but it does not support Google BigQueryData Flow Pipe Source do not support BigQuery yet

Here are the steps to reproduce / get to my situation:

  1. Register Google cloud, setup billing export (of standard usage cost) to BigQuery.
  2. At Azure Synapse Analytics, create a Linked service with user authentication. Please follow Data Tech's Youtube video "Google BigQuery connection (or linked service) in Azure Synapse analytics"
  3. At Azure Synapse Analytics, Integrate, click on the "+" sign -> Copy Data Tool

I believe the answer is at the Source tab with Query and Functions, please help me figure this out, or point me to the right direction. Img: Source Tab

Looking forward to your input. Thanks in advance!

Aswin
  • 4,090
  • 2
  • 4
  • 16
dwssc2023
  • 21
  • 6
  • 1
    **This result in missing columns and inaccurate datetime value at the sink.** --- 1.You are showing in excel. Show if you change the date format of the columns, it will give you same result as source. 2. What columns are missing? From your screenshot, I am not able to find out the missing columns. – Aswin Jan 21 '23 at 08:07
  • @Aswin. Thank you for the comment. I've changed the pipeline's mapping tab at Destination column to DateTime with format `"YYYY-MM-DD`, `"YYYY-MM-DD hh:mm:ss[.nnnnnnn]"` and `"YYYY-MM-DD HH:MI:SS"` They all gave different result: `2022-00-15`, `YYYY-12-DD 01:00:00[.nnnnnnn]` and `YYYY-12-DD 06:12I:SS`. I've tried the debug, but the Output doesn't give details on how it did the format. As for the missing columns, the later part (columns towards right hand side) if they were null, the entire column would be missing. I will have to play with Sink Tab's Copy behavior to find out more. – dwssc2023 Jan 21 '23 at 09:04
  • @Aswin, with regards to the missing columns, it happens in situations like both columns `label.key` and `label.value` are null, then the result would be combined into one columns instead of two. This cause column shifts at the result, for example, the above row would have two columns `{"v":[{"v":{"f":[{"v":"goog-resource-type"}` `{"v":"bigquery_dataset"}]}}]}`, the row below only has one `{"v":[]}`, and then the rest of the data got shifted to the left by one column. – dwssc2023 Jan 21 '23 at 09:58
  • Refer this [MS doc](https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping#hierarchical-source-to-tabular-sink) to convert hierarchical columns to tabular columns. – Aswin Jan 21 '23 at 10:02
  • In copy activity mapping settings, 1)Import schema 2) Toggle the advanced editor 3)Give the array in collection reference so that data will be iterated and extracted at that value. 4) Map the columns from source to sink – Aswin Jan 21 '23 at 10:04
  • @Aswin. The Mapping Import schemas is unable to recognize source as Array, it is treating it as a String. The Collection reference, advanced editor options do not show up. I will edit my question with screenshot to reflect this issue. – dwssc2023 Jan 22 '23 at 19:10

1 Answers1

0

ADF allows you to write the query in google bigquery source dataset. Therefore write the query to unnest the nested columns using unnest operator and then map it to the sink.

I tried to repro this with sample nested table. enter image description here img:1 nested table

enter image description here img:2 sample data of nested table

Script to flatten the nested table:

select
user_id,
a.post_id,
a.creation_date
from  `ds1.stackoverflow_nested`  
cross  join unnest(comments) a

enter image description here img:3 flattened table.

  • Use this query in copy activity source dataset. enter image description here img:4 Source settings of copy activity.
  • Then take the sink dataset, do the mapping and execute the ADF pipeline.

Reference:

  1. MS document on google bigquery as a source - ADF
  2. GC document on unnest operator
Aswin
  • 4,090
  • 2
  • 4
  • 16
  • Thank you, I tried it and got Error statement of "ERROR [42000] [Microsoft][BigQuery] (70) Invalid query: Values referenced in UNNEST must be arrays. UNNEST contains expression of type STRUCT at [5:20] Activity ID: 8547d1bf-5d76-45f9-ade9-598c035f8f77". I tried it with `SELECT svc.id from \`std_usage_cost.gcp_billing_export_v1_01AB70_A377D7_444D9B\` cross join unnest (service) svc;` When I select the entire [service] column, it gives me `{"v":{"f":[{"v":"24E6-581D-38E5"},{"v":"BigQuery"}]}}` . – dwssc2023 Jan 23 '23 at 15:07
  • The Error statement indicates that it saw the service.id and service.description, but what does it mean by saying "UNNEST contains expression of type STRUCT` `" – dwssc2023 Jan 23 '23 at 15:56
  • For struct data, it is not necessary to unnest. If your data is of struct of array type, check this [SO thread](https://stackoverflow.com/questions/58920800/unnesting-structs-in-bigquery). – Aswin Jan 23 '23 at 16:14
  • Thank you so much @Aswin, i had been stuck on this (getting data from BigQuery to Azure) for more than a week. – dwssc2023 Jan 23 '23 at 17:05