I have a nice straight working pipe, where the task I run via luigi on the command line triggers all the required upstream data fetch and processing in it's proper sequence till it trickles out into my database.
class IMAP_Fetch(luigi.Task):
"""fetch a bunch of email messages with data in them"""
date = luigi.DateParameter()
uid = luigi.Parameter()
…
def output(self):
loc = os.path.join(self.data_drop, str(self.date))
# target for requested message
yield LocalTarget(os.path.join(loc, uid+".msg"))
def run(self):
# code to connect to IMAP server and run FETCH on given UID
# message gets written to self.output()
…
class RecordData(luigi.contrib.postgres.CopyToTable):
"""copy the data in one email message to the database table"""
uid = luigi.Parameter()
date = luigi.DateParameter()
table = 'msg_data'
columns = [(id, int), …]
def requires(self):
# a task (not shown) that extracts data from one message
# which in turn requires the IMAP_Fetch to pull down the message
return MsgData(self.date, self.uid)
def rows(self):
# code to read self.input() and yield lists of data values
Great stuff. Unfortunately that first data fetch talks to a remote IMAP server, and every fetch is a new connection and a new query: very slow. I know how to get all the individual message files in one session (task instance). I don't understand how to keep the downstream tasks just as they are, working on one message at a time, since the task that requires one message triggers a fetch of just that one message, not a fetch of all the messages available. I apologize in advance for missing obvious solutions, but it has stumped me so far how to keep my nice simple stupid pipe mostly the way it is but have the funnel at the top suck in all the data in one call. Thanks for your help.