0

My apache beam pipeline looks like this:

  vids = (p|'Read input' >> beam.io.ReadFromText(known_args.input)
       | 'Parse input' >> beam.Map(lambda line: csv.reader([line]).next())
       | 'Run DeepMeerkat' >> beam.ParDo(PredictDoFn(pipeline_args)))

Where I am inputting a csv with a list of videos to analyze. In this test run there were 4 videos.

The pipeline runs fine, but i'm not understanding the autoscaling feature.

Currently identifies 4 elements (right side)

enter image description here

but the console shows rising to 15 workers

How can there be more workers than elements?

enter image description here

bw4sz
  • 2,237
  • 2
  • 29
  • 53
  • Can you provide a job id so I can see how your job ran? – Pablo Aug 19 '17 at 00:19
  • Job id: 2017-08-18_14_54_43-17157689853131491224 – bw4sz Aug 19 '17 at 03:03
  • Pablo, any updates here? – jkff Aug 24 '17 at 16:25
  • bw4sz@ this looks like a bug in autoscaling, but it may have been worked out recently. Have you rerun the job and encountered the issue again? – Pablo Aug 25 '17 at 18:24
  • I haven't, I've been trying to get a handle on how to pass large objects to the local worker using gcsio (here https://stackoverflow.com/questions/44423769/how-to-use-google-cloud-storage-in-dataflow-pipeline-run-from-datalab/44636227#comment78690786_44636227) and here https://stackoverflow.com/questions/45216308/can-we-access-gsutil-from-google-cloud-dataflow-if-yes-then-could-someone-plea/45219722#comment78615429_45219722. I may need to open a new q. Once I get better handle on that, i'll report back. – bw4sz Aug 25 '17 at 19:07

0 Answers0