-1

We are designing a new ingestion framework (Cloud Storage -> BigQuery) using Cloud Functions. However, we receive some files (json, csv) that are corrupted and cannot be inserted as is (bad field names, missing columns, etc.) not even as external tables. Therefore, we would like to ingest every row to one cell as a JSON string and deal with the issues when we cleanse the data in BigQuery.

Is there a way to do that natively and efficiently and as little processing possible (so Cloud Functions wouldn't time out)? I wrote a function that processes the files and wraps lines one by one but for bigger files it won't be an option. We would prefer to stay with Cloud Functions to have this as lightweight as possible.

a54i
  • 95
  • 1
  • 5

1 Answers1

2

My option in that case is to ingest the CSV with a dummy separator, for instance # or |. I know that I will never have those characters and that's why I chose them.

Like that, the schema autodetect detect only 1 column, and create a single string column table.

If you can pick a character like that, it's the easiest solution, but without any guaranty of course (it's corrupted file, it's hard to know in advance what will be the unused characters)

guillaume blaquiere
  • 66,369
  • 2
  • 47
  • 76
  • Hi Guillaume, thanks that's a great tip I haven't thought of. This is perfect for JSON files as we can ignore the wrong columns and only get the good ones out with JSON_VALUE(). For the files that are CSVs, I guess there is no other way but to transform them to JSON if i want to achieve the same? (field as {'col1': 'val', 'col2': 'val}) – a54i Mar 07 '22 at 10:12
  • You can insert CSV line as raw String with then use string function to parse them. But it's less easy than with JSON. For JSON, you can also use the new JSON type to easily browse the JSON tree – guillaume blaquiere Mar 07 '22 at 12:51