6

I am looking for a way to store embedding generated by language model like (T5), in BigQuery of Google.

The embedding are in the form of Numpy array or tensor.

I found 3 approaches:

  1. TFRecord, write it to a TFRecord file and store to cloud storage
  2. convert numpy array to string and store as a String column in a table
  3. store to a column with mode as REPEAT. (Not sure in this way if the order of the embedding vector entries can be preserved)

Hope anybody can give some suggestions or other approaches.

Many thanks

Luke Mao
  • 87
  • 6

1 Answers1

0

Arrays are first-class citizens in BigQuery - see https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays

The mode REPEATED means that the column is an array.

E.g. a column of type STRING in mode REPEATED means that this column can only contain arrays of type string.

The order of elements is preserved. So I guess you just want to directly store your arrays as arrays in BQ.

In case you want to operate on those arrays later using SQL have look at UNNEST(<array>) which turns arrays into tables so you can run SQL directly on the array (using lateral joins or just a subquery).

Martin Weitzmann
  • 4,430
  • 10
  • 19
  • How do you actually implement this in BigQuery? I think this is the original poster's question. – sbecon Jun 27 '21 at 23:10