How to make the environment variables reach Dataflow workers as environment variables in python sdk

Question

I write custom sink with python sdk. I try to store data to AWS S3. To connect S3, some credential, secret key, is necessary, but it's not good to set in code for security reason. I would like to make the environment variables reach Dataflow workers as environment variables. How can I do it?

score 4 · Accepted Answer · answered Oct 28 '16 at 06:05

4

Generally, for transmitting information to workers that you don't want to hard-code, you should use PipelineOptions - please see Creating Custom Options. Then, when constructing the pipeline, just extract the parameters from your PipelineOptions object and put them into your transform (e.g. into your DoFn or a sink).

However, for something as sensitive as a credential, passing sensitive information in a command-line argument might be not a great idea. I would recommend a more secure approach: put the credential into a file on GCS, and pass the name of the file as a PipelineOption. Then programmatically read the file from GCS whenever you need the credential, using GcsIO.

answered Oct 28 '16 at 06:05

jkff

17,623
5
53
85

so there is no way to set PipelineOptions via environment variables? – Andrew Cassidy Jun 21 '18 at 21:54
to update... I'm definitely setting pipelineoptions via environment variables... just access them via pipelineoptions in the actual dataflow job as opposed to expecting them to be environment variables there – Andrew Cassidy Jun 22 '18 at 15:33
@AndrewCassidy , could you please elaborate - how exactly do you pass/set an environment variable to a worker node on the Dataflow, and how do you use access it in the code? – Tim Dec 08 '18 at 10:19
@Timur you pass variables into your dataflow job as CLI args, and use `argparse` https://docs.python.org/2.7/library/argparse.html in your code, which makes it easy to access and use the CLI arguments in `sys.argv`. You don't have any way to set environment variables inside dataflow containers, but the PipelineOptions are available to all worker nodes, you don't have to do anything special there. Example on adding your own CLI args: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/windowed_wordcount.py#L58 – Davos Dec 19 '18 at 05:59
A good way to separate the pipeline args that beam is expecting, vs your own custom "known args" `known_args, pipeline_args = parser.parse_known_args(argv)` also see this list of all the built-in beam args that are parsed from argv using `PipelineOptions(options=argv)` https://cloud.google.com/dataflow/docs/guides/specifying-exec-params#setting-other-cloud-pipeline-options – Davos Dec 19 '18 at 06:46

How to make the environment variables reach Dataflow workers as environment variables in python sdk

1 Answers1

Linked