8

I have built a pipeline of Tasks in Luigi. Because this pipeline is going to be used in different contexts, it was possible that it would require to include more tasks at the beginning of or the end of the pipeline or even totally different dependencies between the tasks.

That's when I thought: "Hey, why declare the dependencies between the tasks in my config file?", so I added something like this to my config.py:

PIPELINE_DEPENDENCIES = {
     "TaskA": [],
     "TaskB": ["TaskA"],
     "TaskC": ["TaskA"],
     "TaskD": ["TaskB", "TaskC"]
}

I was annoyed by having those stacking up parameters throughout the tasks, so at some point I introduced just one parameter, task_config, that every Task has and where every information or data that's necessary for run() is stored. So I put PIPELINE_DEPENDENCIES right in there.

Finally, I would have every Task I defined inherit from both luigi.Task and a custom Mixin class, that would implement the dynamic requires(), which looks something like this:

class TaskRequirementsFromConfigMixin(object):
    task_config = luigi.DictParameter()

    def requires(self):
        required_tasks = self.task_config["PIPELINE_DEPENDENCIES"]
        requirements = [
            self._get_task_cls_from_str(required_task)(task_config=self.task_config)
            for required_task in required_tasks
        ]
        return requirements

    def _get_task_cls_from_str(self, cls_str):
        ...

Unfortunately, that doesn't work, as running the pipeline gives me the following:

===== Luigi Execution Summary =====

Scheduled 4 tasks of which:
* 4 were left pending, among these:
    * 4 was not granted run permission by the scheduler:
        - 1 TaskA(...)
        - 1 TaskB(...)
        - 1 TaskC(...)
        - 1 TaskD(...)

Did not run any tasks
This progress looks :| because there were tasks that were not granted run permission by the scheduler

===== Luigi Execution Summary =====

and a lot of

DEBUG: Not all parameter values are hashable so instance isn't coming from the cache

Although I am not sure if that's relevant.

So: 1. What's my mistake? Is it fixable? 2. Is there another way to achieve this?

Kaleidophon
  • 589
  • 1
  • 5
  • 16
  • The DEBUG logs are due to the DictParameter not being hashable, no relation to the permission issues. – Vikas Tikoo May 30 '17 at 23:27
  • You're not instantiating the Tasks in the requires method, is that intentional? is it somehow done in the helper method? – Veltzer Doron Jun 26 '18 at 15:25
  • I am actuallly (as far as I remember), the ``_get_task_cls_from_str`` returns a class given a string like in ``PIPELINE_DEPENDENCIES`` (so ``"TaskA`` with type string would be turned into ``TaskA`` with type class), which is then given a ``task_config`` argument inside the list comprehension in ``requires``, thus turned into an object of type ``TaskA``. – Kaleidophon Jul 03 '18 at 15:31

1 Answers1

1

I realize this is an old question, but I recently learned how to enable dynamic dependencies. I was able to accomplish this by using a WrapperTask and yielding a dict comprehension (though you could do a list too if you want) with the parameters I wanted to pass to the other tasks in the requires method.

Something like this:

class WrapperTaskToPopulateParameters(luigi.WrapperTask):
    date = luigi.DateMinuteParameter(interval=30, default=datetime.datetime.today())

    def requires(self):
    base_params = ['string', 'string', 'string', 'string', 'string', 'string']
    modded_params = {modded_param:'mod' + base for base in base_params}
    yield list(SomeTask(param1=key_in_dict_we_created, param2=value_in_dict_we_created) for key_in_dict_we_created,value_in_dict_we_created in modded_params.items())

I can post an example using a list comprehension too if there's interest.