2

I have an ML model (text embedding) which outputs a large 1024 length vector of floats, which I want to persist in a BigQuery table.

The individual values in the vector don't mean anything on their own, the entire vector is the feature of interest. Hence, I want to store these lists in a single Column in BigQuery as opposed to one column for each float. Additionally, adding an additional 1024 rows to a table that is originally just 4 or 5 rows seems like a bad idea.

Is there a way of storing a python list or an np.array in a column in BigQuery (maybe convert them to a json first or something along those lines?)

Alex Kinman
  • 2,437
  • 8
  • 32
  • 51
  • Why don't you use an array? https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays – Joaquim Feb 27 '20 at 08:42

1 Answers1

0

Maybe it's not exactly you were looking for, but the following options are the closest workarounds to what you're trying to achieve.

First of all, you can save your data in an CSV file with one column locally and then load that file into BigQuery. There are also other file formats that can be loaded into BigQuery from a local machine, that might interest you. I personally would go with a CSV.

I did the experiment, by creating an empty table in my dataset, without adding a field. Then I used the code mentioned in the first link, after saving a column of my random data in a CSV file.

If you encounter the following error regarding the permissions, see this solution. It uses an authentication key instead.

google.api_core.exceptions.Forbidden: 403 GET https://bigquery.googleapis.com/bigquery/v2/projects/project-name/jobs/job-id?location=EU: Request had insufficient authentication scopes.

Also, you might find this link useful, in case you get the following error:

google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table my-project:my_dataset.random_data. Cannot add fields (field: double_field_0)

Besides loading your data from a local file, can upload your data file on Google Cloud Storage and load the data from there. Many file formats are being supported, as Avro, Parquet, ORC, CSV and newline delimited JSON.

Finally there is an option for streaming the data directly into a BigQuery table by using the API, but it is not available via the free tier.

milia
  • 521
  • 8
  • 20