1

Does anyone have knowledge on escaping characters in Yaml?

I am currently creating pipelines with the StreamSets SDK for Python and now introducing Hydra to store the configs in .yaml files to allow us to tweak or add certain params with Compose and Overrides without changing the config file itself.

One in particular is causing me some issues as I cannot work out how to escape it properly.

In the Cluster tab, spark.home has the value ${runtime:conf('spark_home')} but, as it is, this will error.

The closest I can get without an error is with $\{runtime:conf('spark_home')} (unquoted) which outputs $\\{runtime:conf('spark_home')}.

I have read through various docs on this but none cover having curly braces, brackets & a quoted string within the parameter and I have tried, what feels like, hundreds of combinations.

I have wrapped the whole value and spark_home in single and double quotes, unquoted with special/illegal characters escaped and deconstructed the value to build it bit by bit to pin point where it errors.

The main issue, I think, is the quoted string 'spark_home'. If I remove the single quotes I have no issues. As a side note 'spark_home' can be either single or double quotes but cannot be removed.

UPDATE Hi, thanks for the answers so far. I didn't want to overload the initial question. I am at the initial stages of looking at this and testing if yaml will provide a cleaner solution. As stated I want to be able to increase the memory & drivers for some pipelines so using initialize & compose so I can override certain parameters but here is a simplified version of my setup. On main.py I have:

initialize(config_path="conf", job_name="test_pipeline")
cfg = compose(config_name="config", overrides=[])

And in my config.yaml file:

defaults:
  - streamsets: dev

cluster:
  spark.driver.memory: 8G
  spark.driver.cores: 2
  spark.executor.memory: 8G
  spark.executor.cores: 2
  spark.home: "${runtime:conf('spark_home')}"

If I print(cfg) with the current set up I get a lengthy error all from python3.7/site-packages/... (not sure where I can post this) but the end part has:

    raise GrammarParseError(str(e) if msg is None else msg) from e
hydra.errors.ConfigCompositionException

If I add a backslash before the opening curly brace and remove the double quotes - spark.home: $\{runtime:conf('spark_home')} it prints but I get a double escape in the spark.home parameter. Again I have tried a lot of combinations and read various docs which don't cover this complexity

{'spark.driver.memory': '8G', 'spark.driver.cores': 2, 'spark.executor.memory': '8G', 'spark.executor.cores': 2, 'spark.home': "$\\{runtime:conf('spark_home')}"}

I am using Pycharm (which is what my company uses for Python) and it doesn't highlight any issue when I create the yaml file. I have also re-created a simple version on my own machine using VS Code with the YAML extension and this doesn't highlight any issue either. Thanks

  • 2
    Any decent YAML dumper, including the Python builtin `yaml`, should do this for you. How are you generating these? What does the resulting YAML look like? – tadman Jan 18 '23 at 17:25
  • 2
    Are you hand-writing the YAML file right now? As tadman indicated, the built-in module for generating YAML in Python will do all of this for you, and then some. – Silvio Mayolo Jan 18 '23 at 17:26
  • 1
    I agree with both comments, except: what built-in YAML module? As far as I'm aware, the common Python options are [ruamel.yaml](https://pypi.org/project/ruamel.yaml/) (which supports the current 1.2 YAML spec) and [PyYAML](https://pypi.org/project/PyYAML/) (which doesn't, but otherwise works well). Neither is included in Python. – CrazyChucky Jan 18 '23 at 22:44
  • 1
    FYI Hydra uses [`pyyaml`](https://github.com/yaml/pyyaml) as a backend. – Jasha Jan 19 '23 at 21:14

0 Answers0