4

Let’s say I have a task that downloads some meteo data for a given date (just for the sake of this example) and saves it in a CSV file.

Let’s say for the first iteration I can only download that data from an API

class DownloadMetoDataTask(luigi.Task):
     date = luigi.Parameter()
        def __init__(self, *args, **kwargs):
          super().__init__(*args, **kwargs)
          self.downloader = MetoAPI()
      
     def output(self):
        return LocalTarget(f"meteo_data_{self.date}.csv")

       def run(self):
        data = self.downloader.get_data(self.date)
        self.output().write(data)

Later on, we find that this data can be downloaded from local storage/some DB/FTP server or anything really.

To handle this we can:

  1. inherit from DownloadMetoDataTask
class DownloadMetoDataTaskFromAPI(DownloadMetoDataTask):
#...
class DownloadMetoDataTaskFromDB(DownloadMetoDataTask):
#... 
  1. or use composition instead of inheritance by making self.downloader an instance attribute that can be provided in the __init__ method

downloader is a complex object which purpose is to download meteo data. The specific details are encapsulated in every class.

Clients know that they can call get_data(date) and have meteo data for a given date. They don’t worry about how the data is retrieved (API, local storage …)

class APIMetoDownloader:
     def get_data(self, date):
         # ...

class FTPMeteoDownloader:
      def get_data(self, date)
          # ....

I would like to go with 2 because 1 leads to class proliferation and a complex hierarchy of classes.

This is what I tried to do:

  1. Use an instance attribute which is not a Luigi parameter
class MeteoDownloaderTask(luigi.Task):
    date = luigi.Parameter()
    def __init__(self, downloader, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.downloader = downloader

t = MeteoDownloaderTask(downloader= APIMetoDownloader(), date = "2021_11_15")

This is failing with

---------------------------------------------------------------------------
UnknownParameterException                 Traceback (most recent call last)
<ipython-input-44-a0a196a3cb71> in <module>
----> 1 t = MeteoDownloaderTask(downloader= APIMetoDownloader(), date = "2021_11_15")

~/sandbox/luigi-sandbox/env/lib/python3.9/site-packages/luigi/task_register.py in __call__(cls, *args, **kwargs)
     85
     86         params = cls.get_params()
---> 87         param_values = cls.get_param_values(params, args, kwargs)
     88
     89         k = (cls, tuple(param_values))

~/sandbox/luigi-sandbox/env/lib/python3.9/site-packages/luigi/task.py in get_param_values(cls, params, args, kwargs)
    410                 raise parameter.DuplicateParameterException('%s: parameter %s was already set as a positional parameter' % (exc_desc, param_name))
    411             if param_name not in params_dict:
--> 412                 raise parameter.UnknownParameterException('%s: unknown parameter %s' % (exc_desc, param_name))
    413             result[param_name] = params_dict[param_name].normalize(arg)
    414

UnknownParameterException: MeteoDownloaderTask[args=(), kwargs={'downloader': <__main__.APIMetoDownloader object at 0x10ba3ca30>, 'date': '2021_11_15'}]: unknown parameter downloader
  1. Make downloader an insignificant parameter
class DownloadMetoDataTask(luigi.Task):
     date = luigi.Parameter()
     downloader = luigi.Parameter(significant=False)

t = DownloadMetoDataTask(downloader=APIDownloader(), date = "2021_11_15")

This is printing a warning. In addition, I find it confusing to add a “fake” Luigi parameter just to add an instance attribute for the task.

/Users/mbo/sandbox/luigi-sandbox/env/lib/python3.9/site-packages/luigi/parameter.py:279: UserWarning: Parameter "downloader" with value "<__main__.APIDownloader object at 0x10aa52520>" is not of type string.

Questions

  • For solution 2: How does this impact Luigi instance caching ? (the task_id is the same for two instances of DownloadMetoDataTask with the same set of significant parameters)
  • For solution 2: What are the implications with regards to the interface of self.downloader must it comply to the Parameter interface ?

I could also redefine the class methods of Luigi.Task (one is get_param_values ) to only get parameters that are defined as Luigi ones and ignore regular instance attributes. Is it safe/advised to do so ?

Is there a better a way to achieve this ? To use instance attributes that are note necessarily Luigi Task arguments.

Things I don't want to do

  • have a fake downloader string Luigi Parameter that can be used to get a downloader
downloaders_mapping = {"api": APIMetoDownloader(), "ftp": FTPMetoDownloader()}

class MeteoDownloaderTask(luigi.Task):
    date = luigi.Parameter()
    downloader_key = luigi.Parameter()
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.downloader = downloaders_mapping[kwargs["self.downloader_key"]]

This adds the additional overhead of maintaining the downloaders_mapping dict

  • I dont' want to create specific classes per downloader used because it can lead to class proliferation and a complex long hierarchy of classes
MassyB
  • 1,124
  • 4
  • 15
  • 28
  • Luigi has no problems with instance attributes. The problem is that `downloader` supplied to `__init__`. A Luigi Task is designed to be called by the user. Hence if the user doesn't need to know where the data is coming from, the `DownloadMetaDataTask` shouldn't have a `downloader` argument and the `run` method should find out by itself which `downloader` to attach. If you set the `downloader` too early, you get a massive race condition there. – Joooeey Dec 30 '21 at 11:26

1 Answers1

0

The important question is: do you need/want/should the downloader to be a task parameter or not? That will affect the options you have. Keep in mind that it affects the task class instance caching, and potentially how the downloader is shared etc. It's an important decision.

Assuming the downloader should be a parameter. How about a custom parameter class? afair Luigi parameters need to serialisable, one of the reason being that you need to be able to supply them via CLI, serde etc. You could handle all of that (in any way you want, including not handling CLI args, in the parse/serialize methods) in the DownloaderParameter class.

class DownloaderParameter(luigi.Parameter):
  ...

class DownloadMetoDataTask(luigi.Task):
  date = luigi.Parameter()
  downloader = DownloaderParameter(default=...)
  ...

Keep in mind that normalized value of parameters is used to hash key for the task instance caching.


Assuming downloader should not be a task parameter, then it depends on the UX of your task classes, your options:

  • Mixin downloader classes - one for each downloader NOT task, and downstream tasks can mixin the downloaders as they wish
  • (abstract) property/field

I could also redefine the class methods of Luigi.Task (one is get_param_values ) to only get parameters that are defined as Luigi ones and ignore regular instance attributes. Is it safe/advised to do so ?

I would be careful with this one, you can of course make it work safe and sound, but you are touching internals which may have unexpected consequences (e.g. instance caching).


I would suggest starting with either DownloaderParameter, or Mixin approach.

rav
  • 3,579
  • 1
  • 18
  • 18
  • Thanks @rav for the answer, that really helps. Indeed I don't want to make the `downloader` a Luigi Parameter. I want to implement the Strategy pattern here in order to avoid a complex class hierarchy which tends to be the case with Luigi. The Mixin seems appealing but it also demands the creation of specific classes like `APIMeteoDownloaderTask `, `FTPMeteoDownloaderTask`. Please correct me if I'm wrong :) – MassyB Nov 24 '21 at 09:53
  • could you please elaborate on the mixin solution you are thinking about ? – MassyB Nov 24 '21 at 12:34
  • @MassyB in the context of Mixin, I was assuming you provide: 1. base task class (with default implementation of run etc) 2. downloader Mixins, and your "users"/downstream tasks choose whichever Mixin they want to use per task: https://gist.github.com/ravwojdyla/583763e7e72c51eff5c4949a1829d9d3 – rav Nov 24 '21 at 17:54
  • The question then is: what's the UX of those task classes? If you assume canonical Luigi task UX via CLI - you will need a parameter for that - task instances are created via constructor call, so unless you have a parameter of some kind or specific tasks classes - you won't be able to control the downloader, right? If you somehow assume the task instances are created programmatically (which is possible, but afair isn't the canonical use of Luigi), you could always have a static factory method like `FooTask.fromDownload(downloader)`? – rav Nov 24 '21 at 17:57
  • @MassyB just curious - did you find a solution? – rav Dec 22 '21 at 09:53