GCP Dataflow Computation Graph and Job Execution

Question

Hi Everyone I tried hard to understand what is happening when I create a custom template in Google cloud Dataflow but failed to understand. Thanks to GCP documentations. Below is what I am achieving.

Read Data from Google cloud Bucket
Pre-Process it
Load Deeplearning models (1 GB each) and get the predictions
Dump the results in BigQuery.

I successfully created the template and I am able to execute the job. But I have following questions.

When I execute the job, Everytime the models (5 models and each of 1GB) gets downloaded during execution OR the models are loaded and placed in the template (Execution Graph) and during execution it uses the loaded ones
If loading of the models happen only during the job execution, then does it not impact the execution time? Since it has to load GBs of Model files everytime the job is triggered?
Can multiple users trigger the same template at same time? Since I want to productionize it, I am not sure how this will handle multiple requests at same time?

Can anyone please share some information on it?

Sources I referred and failed to get the answer: https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#pipeline-lifecycle-from-pipeline-code-to-dataflow-job http://alumni.media.mit.edu/~wad/magiceight/isa/node3.html https://cloud.google.com/dataflow/docs/guides/setting-pipeline-options#configuring-pipelineoptions-for-local-execution https://beam.apache.org/documentation/basics/ https://beam.apache.org/documentation/runtime/model/ https://mehmandarov.com/apache-beam-pipeline-graph/

score 0 · Answer 1 · answered Aug 16 '21 at 18:12

0

This depends on where the models are being loaded from. If they're loaded in the DoFns (most likely), then it will happen in the workers (during job execution).

As for your other question, there should be no issues with multiple users triggering a template job simultaneously.

answered Aug 16 '21 at 18:12

robertwb

4,891
18
21

Hi @robertwb Thank you so much. Yes the models are loaded in DoFns. Does loading the HUUGE model files stored in GCS impact the execution time? IS there any way to avoid it? like staging the model files? – Chaitanya Patil Aug 16 '21 at 19:09
You could possibly build custom containers that contain the models. https://cloud.google.com/dataflow/docs/guides/using-custom-containers – robertwb Aug 16 '21 at 19:46
Hi @robertwb I tried using custom containers. Created docker image will the model files in it and tried to run the pipeline using the below command python3 main.py --input=dsss --experiment=use_runner_v2 --sdk_container_image=$IMAGE_URI But i get SDK harness sdk-0-0 disconnected. Error – Chaitanya Patil Aug 17 '21 at 12:50
This sounds like something easier to debug on the users@beam.apache.org list rather than here (though we could come back with the answer). – robertwb Aug 17 '21 at 17:28

GCP Dataflow Computation Graph and Job Execution

1 Answers1

Linked