4

My TaskB requires TaskA, and on completion TaskA writes to a MySQL table, and then TaskB is to take in this output to the table as its input.

I cannot seem to figure out how to do this in Luigi. Can someone point me to an example or give me a quick example here?

Rijo Simon
  • 777
  • 3
  • 15
  • 35

1 Answers1

11

The existing MySqlTarget in luigi uses a separate marker table to indicate when the task is complete. Here's the rough approach I would take...but your question is very abstract, so it is likely to be more complicated in reality.

import luigi
from datetime import datetime
from luigi.contrib.mysqldb import MySqlTarget


class TaskA(luigi.Task):
    rundate = luigi.DateParameter(default=datetime.now().date())
    target_table = "table_to_update"
    host = "localhost:3306"
    db = "db_to_use"
    user = "user_to_use"
    pw = "pw_to_use"

    def get_target(self):
        return MySqlTarget(host=self.host, database=self.db, user=self.user, password=self.pw, table=self.target_table,
                           update_id=str(self.rundate))

    def requires(self):
        return []

    def output(self):
        return self.get_target()

    def run(self):
        #update table
        self.get_target().touch()


class TaskB(luigi.Task):
    def requires(self):
        return [TaskA()]

    def run(self):
        # reading from target_table
MattMcKnight
  • 8,185
  • 28
  • 35
  • Thanks for this Matt. This really helps. I had one question though? does this mean that MySqlTarget keeps track of what row is getting updated using update_id, which is the primary id of the row. And in which case if my primary ids are autoincrement, what do i do? – Rijo Simon Nov 07 '16 at 19:52
  • Oh, that's tricky. I think you'll have to use another unique value beyond the autoincrement id as the update_id. It's literally running `"""INSERT INTO {marker_table} (update_id, target_table) VALUES (%s, %s) ON DUPLICATE KEY UPDATE update_id = VALUES(update_id) """.format(marker_table=self.marker_table), (self.update_id, self.table)` – MattMcKnight Nov 07 '16 at 21:01
  • So the solution seems, have an update table where the workflow updates are logged? But I don't want to maintain a table for each task (and I have a lot of tasks.) So from your sql interpretation of what is happening, seems like I should do this `return MySqlTarget(host=self.host, database=self.db, user=self.user, password=self.pw, table=self.target_table, update_id=str(self.rundate), update_task_type = TASK_TYPE_A)` – Rijo Simon Nov 07 '16 at 21:23
  • Yes, that's what the MySQLTarget does by default, it creates that "marker_table". One of the columns in that table is "target_table", which is presumably the table updated by your job. – MattMcKnight Nov 08 '16 at 12:17