0

I am working on a data pipeline that follows a structure like this:

-- src/
---- etl.py
---- scripts/
------ moduleA.py
------ moduleB.py

I want to parametrise the scripts with Hydra. I have already done it for moduleA, which can be run independently:

import os

import hydra
from omegaconf import DictConfig

@hydra.main(config_path=os.getcwd(), config_name="config")
def main(cfg: DictConfig) -> None:

    # Parse args
    input_file: str = cfg.params.input_file

    do_stuff(input_file)

I would like to have the same approach for moduleB et al and being able to to instantiate these main() from etl.py, which will act as the orchestrator.

TLDR: Is it possible to parametrise a function that reads from a config file without having to renounce to use Hydra? I would like etl.py to be something like this:

import os

import hydra
from omegaconf import DictConfig

from scripts.moduleA import main as process_moduleA

@hydra.main(config_path=os.getcwd(), config_name="config")
def main(cfg: DictConfig) -> None:

    # Parse args
    input_file: str = cfg.params.input_file

    process_moduleA(input_file)

Many thanks in advance!!

irenels
  • 3
  • 2

1 Answers1

0

The typical pattern is to use an if __name__ == "__main__" guard in your moduleA.py and moduleB.py files:

# moduleA.py
import hydra
from omegaconf import DictConfig

@hydra.main(config_path=os.getcwd(), config_name="config")
def main(cfg: DictConfig) -> None:
   ...

if __name__ == "__main__":
    main()

This way, when you call moduleA from your etl.py script, the Hydra machinery in moduleA.py will not be triggered.

Jasha
  • 5,507
  • 2
  • 33
  • 44