0

I am trying to store messages sent from an IoT device in a BigQuery table.

The cloud architecture is as follows:

Local Device -> json_message -> mqtt_client -> GC IoT device -> Device Registry -> Pub/Sub Topic -> Dataflow with Pub/Sub Topic to BigQuery Template -> BigQuery Table

I have gotten this system working with a non-nested JSON message that is constructed like this

json_dict = {"instrument": instrument,
 "spectrum": str(self.spectrum),
 "spectrum_creation_time": self.creation_dt.timestamp(),
 "messenger_creation_time": time.time()}

return json.dumps(json_dict)

The table in BigQuery, which successfully stores this data has the following schema:

   Last modified                  Schema                 Total Rows   Total Bytes   Expiration   Time Partitioning   Clustered Fields   Labels
 ----------------- ------------------------------------ ------------ ------------- ------------ ------------------- ------------------ --------
  04 Sep 00:24:22   |- instrument: string                1277         81897474
                    |- spectrum: string
                    |- spectrum_creation_time: string
                    |- messenger_creation_time: string

Now I am trying to get this same system working with a nested JSON message, which is constructed like this:

json_dict = {'timestamp': 'AUTO',
             'values': [
                        {'id': instrument + '.Time',
                         'v': time.time(),
                         't': time.time()},
                        {'id': instrument + 'Intensity',
                         'v': str(self.spectrum),
                         't': self.creation_dt.timestamp()}
                        ]}

return json.dumps(json_dict)

I am trying to store it in a BigQuery table with the following schema:

   Last modified               Schema              Total Rows   Total Bytes   Expiration   Time Partitioning   Clustered Fields   Labels
 ----------------- ------------------------------ ------------ ------------- ------------ ------------------- ------------------ --------
  09 Sep 23:56:20   |- timestamp: timestamp        0            0
                    +- values: record (repeated)
                    |  +- time: record
                    |  |  |- id: string
                    |  |  |- v: string
                    |  |  |- t: timestamp
                    |  +- spectrum: record
                    |  |  |- id: string
                    |  |  |- v: string
                    |  |  |- t: timestamp

Unfortunately, when I try this approach I get the following error, which is output to an error table in BigQuery by DataFlow.

{"errors":[{"debugInfo":"","location":"values[0].v","message":"no such field: v.","reason":"invalid"}],"index":0}
null

What is the best way to solve this issue? I cannot change the nested JSON structure, because I am building a test suite and this is the required format.

dda2120
  • 41
  • 1
  • 5
  • There are at least two possible problems I think. First time.time() serializes to json as a number for the first value. Second, I think you the schema doesn't seem to match the struct hierarchy, it seems that you are need keys for "time" and "spectrum" to match the hierarchy. You mention the structure isn't changeable, but I don't think bigquery can do the type of inference you need here without the keys. – Micah Kornfield Sep 11 '21 at 18:49
  • @Miach Kornfield I have been assuming that the types of the elements in the json shouldn't matter since the json.dumps function turns everything into a string. Maybe this assumption is incorrect. Is str the prefered datatype for timestamps? I agree that the schema doesn't appear to match the struct hierarchy completely. I guess it is impossible to make a BigQuery schema that matches completely. If that is true, I will need to try another, completely different, approach. – dda2120 Sep 11 '21 at 22:14
  • Given the unnested column works bq might be lenient when parsing JSON. Given the architecture you have it seems there is a lot of flexibility for either modifying the struct or modifying the schema. Given the nested JSON I think "values" `repeated` (with the additional time and spectrum fields) would be the closest schema but it is a little hard to tell from the details of the question – Micah Kornfield Sep 12 '21 at 03:19
  • Thank you! The following schema timestamp, repeated worked! – dda2120 Sep 14 '21 at 21:28

1 Answers1

0

I was able to solve my issue with help from @Miach Kornfield who commented on my original question. Here is my solution.

The JSON data I sent to GCP looked like

json_dict = {"timestamp": "1631554378.2955232",
             'values': [
                        {"id":"testA.Time",
                         "v": "1631554378.2955232",
                         "t": "1631554378.2955232"},
                        {"id": "testA.Time.Intensity",
                         "v": "[1, 43, 4..]",
                         't': "1631554378.2955232"}
                         ]
            }

The original schema for my bigquery table was

Original schema for bigquery

or in text form

            Schema
------------------------------
 |- timestamp: timestamp
 +- values: record (repeated)
 |  +- time: record
 |  |  |- id: string
 |  |  |- v: string
 |  |  |- t: timestamp
 |  +- spectrum: record
 |  |  |- id: string
 |  |  |- v: string
 |  |  |- t: timestamp

The schema that worked was

schema that worked

or in text form

            Schema
------------------------------
 |- timestamp: timestamp
 +- values: record (repeated)
 |  |- id: string
 |  |- v: string
 |  |- t: timestamp

By indicating that values is of type record (repeated) it means that it is an array of structs, with the structure of the structs specified by the subcolumns. The detail (that the structure of the structs was specified by the subcolumns) was not obvious to me, and why I had so much trouble resolving this issue. I am not sure if it is possible to have a record (repeated) with heterogeneous schema.

dda2120
  • 41
  • 1
  • 5