0

Following the link I found in Google, I'm trying to do a sample setup to publish message in pubsub and load the same into bigquery table using dataflow sql.

But when I create dataflow job am getting below error:

Invalid/unsupported arguments for SQL job launch: Invalid table specification in Data Catalog: Unsupported schema specified for Pubsub source in CREATE TABLE.CREATE TABLE for Pubsub topic must include at least 'event_timestamp' field of type 'TIMESTAMP'"

Kindly help me to fix this and clarify my below doubts:

  1. Is it mandatory to keep event_timestamp field in pubsub schema/dataflow sql/bigquery table?
  2. When I create pubsub topic with schema it didnt reflect in dataflow sql whereas when I assign it manually from cloud shell using gcloud data-catalog entries update it reflects in dataflow sql when searching the topic name it showed the schema. So which is the right method to assign schema to pubsub topic
  3. Data catalog also not showing the schema assigned to the pubsub topic.

Let me know if anymore details are required.

Nestor Ceniza Jr
  • 976
  • 3
  • 11
Vanaja Jayaraman
  • 753
  • 3
  • 18
  • Following this [documentation](https://cloud.google.com/dataflow/docs/tutorials/join-streaming-data-with-sql#assign-pubsub-schema), did you remove the `event_timestamp` column? – Anjela B Jul 08 '22 at 03:26
  • @AnjelaB In the documentation nowhere its mentioned to remove the event_timestamp column. Error message says event_timestamp is must. – Vanaja Jayaraman Jul 08 '22 at 03:51

1 Answers1

0

I was able to follow the documentation and it yielded successful results. Response to your questions:

1. Is it mandatory to keep event_timestamp field in pubsub schema/dataflow sql/bigquery table?

Yes it is necessary in this scenario since in the following query, it uses TUMBLE function and event_timestamp column is a DESCRIPTOR. Note: For a Pub/Sub source, you must specify the event_timestamp field as the timestamp_column:

 SELECT
   sr.sales_region,
   TUMBLE_START("INTERVAL 15 SECOND") AS period_start,
   SUM(tr.amount) as amount
 FROM pubsub.topic.`project-id`.transactions AS tr
   INNER JOIN bigquery.table.`project-id`.dataflow_sql_tutorial.us_state_salesregions AS sr
   ON tr.state = sr.state_code
 GROUP BY
   sr.sales_region,
   TUMBLE(tr.event_timestamp, "INTERVAL 15 SECOND")

2. When I create pubsub topic with schema it didnt reflect in dataflow sql whereas when I assign it manually from cloud shell using gcloud data-catalog entries update it reflects in dataflow sql when searching the topic name it showed the schema. So which is the right method to assign schema to pubsub topic

You can use console/gcloud on assigning a schema. However, when using console/gcloud command, these are subjected to the following limitations:

The following are some guidelines about using schemas:

  • You cannot add schemas to existing topics.
  • You can specify a schema only when you create a topic.
  • After a schema is associated with a topic, you cannot update the schema or remove its association with that topic.
  • You can apply the same schema to other new topics.

You can use gcloud data-catalog entries update when updating an existing schema.

3. Data catalog also not showing the schema assigned to the pubsub topic.

You may use gcloud data-catalog entries lookup and let me know.

Anjela B
  • 1,150
  • 1
  • 2
  • 7