0

So, I'm just getting started with Apache Beam. I plan to run DataFlow jobs in GCP, I was originally running them with DataPrep but I quickly outgrew its functionality. Caveat, I have been programming in Python 2/3 for 2 years now, so I think I've moved on from novice to amateur, just for your awareness. So here is my problem, I successfully wrote some AB code (version 2.6) in my IDE. But I couldn't get anything to actually work. That is, even after reading in a csv file to a PCollection, I couldn't SEE that it had worked. That is, it just says "PCollection Object at 0xf3a6..."

So I saw another persons post on this while I was feverishly googling, and they said you should use the "with" statement so Python will auto open and close the pipleline? So, once I did this, I was at least able to write the output of what I just read in to file to see that SOMETHING happened. So, first off, I find it really strange that SAME code I had written before didn't do anything until I put it into the with statement...what's up with that? Do I need to do everything for the pipeline in a with statement? And other defs are just for normal Python stuff? Here is the code:

def run(self, argv=None):

    #p = beam.Pipeline()
    with beam.Pipeline(options=PipelineOptions()) as p:
        left_side = p | 'Read_Left_Side' >> beam.io.ReadFromText('/me/left_side_table.csv')
        left_side | 'Write' >> beam.io.WriteToText('/me/', file_name_suffix='purple_nurple.csv')
        right_side = p | 'Read_Right_Side' >> beam.io.ReadFromText('/me/right_side_table.csv')
    # left_side = p | 'Read_Left_Side' >> beam.io.ReadFromText('gs://path/to/left_side.csv')
    # right_side = p | 'Read_Right_Side' >> beam.io.ReadFromText('gs://path/to/right_side.csv')

    hello=[1,2,3,4,5,6]|beam.Map(lambda x: 3**x)

    left_side = p | 'Read' >> beam.io.ReadFromText('/me/left_side_table.csv')
    left_side | 'Write' >> beam.io.WriteToText('/me/', file_name_suffix='purple_nurple.csv')
    print(left_side)
    right_side = p | 'Read' >> beam.io.ReadFromText('/me//right_side_table.csv')
    howdy= left_side|beam.Map(lambda x: x/2)
    pass
DMan
  • 73
  • 1
  • 10

1 Answers1

0

You need to call piepleine.run() to execute the pipeline. Beam pipeline also follow resource idiom mentioned here https://docs.python.org/2.7/reference/compound_stmts.html#the-with-statement So when you use with pipeline, you don't need to call pipeline.run(). You can use either approach in your code. To answer your questions

So, first off, I find it really strange that SAME code I had written before didn't do anything until I put it into the with statement...what's up with that?

Beam Pipeline follow idiom here https://docs.python.org/2.7/reference/compound_stmts.html#the-with-statement

Do I need to do everything for the pipeline in a with statement?

If you use the resource idiom then yes. But if you call pipeline.run() yourself then there is no with statement. In your code, you are using 'with' so the pipeline modifications after 'with' are not applied to the job.

And other defs are just for normal Python stuff?

Which defs?

Ankur
  • 759
  • 4
  • 7
  • Thanks for answering Ankur. What I mean by "other defs" is that if there is functionality outside of AB functions, I could call a separate function within the class. I think I understand now, thank you again. – DMan Aug 26 '18 at 10:17