Let’s say I have a task
that downloads some meteo data for a given date (just for the sake of this example) and saves it in a CSV file.
Let’s say for the first iteration I can only download that data from an API
class DownloadMetoDataTask(luigi.Task):
date = luigi.Parameter()
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.downloader = MetoAPI()
def output(self):
return LocalTarget(f"meteo_data_{self.date}.csv")
def run(self):
data = self.downloader.get_data(self.date)
self.output().write(data)
Later on, we find that this data can be downloaded from local storage/some DB/FTP server or anything really.
To handle this we can:
- inherit from
DownloadMetoDataTask
class DownloadMetoDataTaskFromAPI(DownloadMetoDataTask):
#...
class DownloadMetoDataTaskFromDB(DownloadMetoDataTask):
#...
- or use composition instead of inheritance by making
self.downloader
an instance attribute that can be provided in the__init__
method
downloader
is a complex object which purpose is to download meteo data. The specific details are encapsulated in every class.
Clients know that they can call get_data(date)
and have meteo data for a given date. They don’t worry about how the data is retrieved (API, local storage …)
class APIMetoDownloader:
def get_data(self, date):
# ...
class FTPMeteoDownloader:
def get_data(self, date)
# ....
I would like to go with 2 because 1 leads to class proliferation and a complex hierarchy of classes.
This is what I tried to do:
- Use an instance attribute which is not a Luigi parameter
class MeteoDownloaderTask(luigi.Task):
date = luigi.Parameter()
def __init__(self, downloader, *args, **kwargs):
super().__init__(*args, **kwargs)
self.downloader = downloader
t = MeteoDownloaderTask(downloader= APIMetoDownloader(), date = "2021_11_15")
This is failing with
---------------------------------------------------------------------------
UnknownParameterException Traceback (most recent call last)
<ipython-input-44-a0a196a3cb71> in <module>
----> 1 t = MeteoDownloaderTask(downloader= APIMetoDownloader(), date = "2021_11_15")
~/sandbox/luigi-sandbox/env/lib/python3.9/site-packages/luigi/task_register.py in __call__(cls, *args, **kwargs)
85
86 params = cls.get_params()
---> 87 param_values = cls.get_param_values(params, args, kwargs)
88
89 k = (cls, tuple(param_values))
~/sandbox/luigi-sandbox/env/lib/python3.9/site-packages/luigi/task.py in get_param_values(cls, params, args, kwargs)
410 raise parameter.DuplicateParameterException('%s: parameter %s was already set as a positional parameter' % (exc_desc, param_name))
411 if param_name not in params_dict:
--> 412 raise parameter.UnknownParameterException('%s: unknown parameter %s' % (exc_desc, param_name))
413 result[param_name] = params_dict[param_name].normalize(arg)
414
UnknownParameterException: MeteoDownloaderTask[args=(), kwargs={'downloader': <__main__.APIMetoDownloader object at 0x10ba3ca30>, 'date': '2021_11_15'}]: unknown parameter downloader
- Make downloader an insignificant parameter
class DownloadMetoDataTask(luigi.Task):
date = luigi.Parameter()
downloader = luigi.Parameter(significant=False)
t = DownloadMetoDataTask(downloader=APIDownloader(), date = "2021_11_15")
This is printing a warning. In addition, I find it confusing to add a “fake” Luigi parameter just to add an instance attribute for the task.
/Users/mbo/sandbox/luigi-sandbox/env/lib/python3.9/site-packages/luigi/parameter.py:279: UserWarning: Parameter "downloader" with value "<__main__.APIDownloader object at 0x10aa52520>" is not of type string.
Questions
- For solution 2: How does this impact Luigi instance caching ? (the task_id is the same for two instances of
DownloadMetoDataTask
with the same set of significant parameters) - For solution 2: What are the implications with regards to the interface of
self.downloader
must it comply to theParameter
interface ?
I could also redefine the class methods of Luigi.Task
(one is get_param_values
) to only get parameters that are defined as Luigi ones and ignore regular instance attributes. Is it safe/advised to do so ?
Is there a better a way to achieve this ? To use instance attributes that are note necessarily Luigi Task arguments.
Things I don't want to do
- have a fake
downloader
string Luigi Parameter that can be used to get a downloader
downloaders_mapping = {"api": APIMetoDownloader(), "ftp": FTPMetoDownloader()}
class MeteoDownloaderTask(luigi.Task):
date = luigi.Parameter()
downloader_key = luigi.Parameter()
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.downloader = downloaders_mapping[kwargs["self.downloader_key"]]
This adds the additional overhead of maintaining the downloaders_mapping
dict
- I dont' want to create specific classes per downloader used because it can lead to class proliferation and a complex long hierarchy of classes