1

I wanted to write three attributes (data, attributes and publish time) of a Pub/Sub message to Bigquery and wanted them to print in a flattened way so that all elements writes in a single row, for example:

data[0] data[1] attr[0] attr[0] key publishTime
data data attr attr key publishTime

I'm currently using the following piece of code for decoding and parsing the message but this is applicable only for the data part of the Pub/Sub message:

class decodeMessage:
    def decode_base64(self,element):
        """Decode base64, padding being optional."""
        return json.dumps(element.data.decode("utf-8"))

class parseMessage:
    def parseJsonMessage(self,element):

        return(json.loads(element))
       

I've also tried merging two json after dumping them from Json objects to Json string but it didn't go as planned, my ultimate goal is to bring all columns into a single JSON with the schema retained.

I hope my question remains clear to you! Thanks!

Mihir Sharma
  • 114
  • 8
  • Data being the payloads, correct? So you want to convert multiple Pub/Sub messages into a single row on BigQuery (assuming it from data[0, data[1], etc? I am just confused on how you can make sure the schema would fit a dynamic number of records. Have you looked at this: [Writing to BigQuery](https://beam.apache.org/documentation/io/built-in/google-bigquery/#writing-to-bigquery)? – Bruno Volpato Oct 07 '22 at 12:05
  • Yes, data being the payload or we can say the PubSub object we get when read from a subscription using a IO connector in Apache beam, I just wanted a way to append PubSub message payload and PubSub message attribute in a single json, so that it can be written to bigquery, as bigquery can interpret a single JSON. – Mihir Sharma Oct 07 '22 at 12:11
  • 1
    I think if you can post the pipeline code instead of only decodeMessage and parseMessage it will be beneficial. You are choosing to use only the `element.data` part, I don't see why you couldn't use `element.attributes` and concatenate it together before doing `json.dumps`. – Bruno Volpato Oct 07 '22 at 12:27
  • I've added the code for you and I hope it helps you to further explain me! – Mihir Sharma Oct 07 '22 at 12:44
  • 1
    I think it's better to convert your PubSubMessage to `Dict` and then write the result to `Bigquery`. How do you think about that ? – Mazlum Tosun Oct 07 '22 at 22:49
  • Thanks for bringing me out the solutions again and again and keeping my motivation at par!! – Mihir Sharma Oct 08 '22 at 05:51

1 Answers1

1

The solution to the following problem is to simply make a Python dictionary and append all the data into a new Dictionary.

example:

    payload = dict()
    data = json.dumps(element.data.decode('utf-8'))
    attributes = json.dumps(element.attributes)
    messageKey = element.message_id
    publish_time = (element.publish_time).timestamp()*1000
    
    payload['et'] = publish_time
    payload['data'] = data
    payload['attributes'] = attributes
    payload['key'] = messageKey
    
    return (payload)
Mihir Sharma
  • 114
  • 8