2

I have a piece of apache beam pipe code that reads from a file in the GCS bucket and prints it. It is working perfectly with the DirectRunner and prints the file output but with the Dataflow runner it is not printing anything no errors as well.

Do we need to do anything special/different for the Dataflow runner?

Code looks like this

  p = beam.Pipeline(options=pipeline_options)
  read_file_pipe = (
      p
      | "Create {}".format(file_name) >> beam.Create(["Start"])
      | "Read File {}".format(file_name)
      >> ReadFromTextWithFilename(file_path, skip_header_lines=1)
      | beam.Map(print)
  )

  p.run().wait_until_finish()

call stack is python3 Test_Pipe.py --region us-central1 --output_project= --runner=DataflowRunner --project= --temp_location= --service_account_email= --experiments=use_network_tags=default-uscentral1 --subnetwork --no_use_public_ips

HKS
  • 21
  • 2
  • 2
    did you look in the "Worker logs" tab ? Did you change the level of severity of logs ? (INFO, DEBUG etc). Dis you try to use the `logging` package instead of `print` – Dev Yns Oct 16 '22 at 21:45
  • If the job fails, it will definitely log errors in the Google Cloud console. But `print` will not write anything to the Dataflow logs. You need to set up `logging` in Python – Travis Webb Oct 24 '22 at 19:52

1 Answers1

2

You can use logging instead of print to solve your issue, I added your code snippet with logging :

import logging

p = beam.Pipeline(options=pipeline_options)
  read_file_pipe = (
      p
      | "Create {}".format(file_name) >> beam.Create(["Start"])
      | "Read File {}".format(file_name)
      >> ReadFromTextWithFilename(file_path, skip_header_lines=1)
      | beam.Map(self.log_element)
  )

  p.run().wait_until_finish()

 def log_element(self, element):
     logging.info(element)

     return element
Mazlum Tosun
  • 5,761
  • 1
  • 9
  • 23