8

So basically I have quite a complex workflow, which looks similar to this:

>>> res = (add.si(2, 2) | add.s(4) | add.s(8))()
>>> res.get()
16

Afterwards it's rather trivial for me to walk up the result chain and collect all individual results:

>>> res.parent.get()
8

>>> res.parent.parent.get()
4

My problem is, what if my third task depends on knowing the result of the first one, but in this example only receives the result of the second?

Also the chains are quite long and the results aren't that small, so just passing through the input as a result would unnecessarily pollute the result-store. Which is Redis, so limitations when using RabbitMQ,ZeroMQ,... don't apply.

miradulo
  • 28,857
  • 6
  • 80
  • 93
WhatIsName
  • 2,294
  • 4
  • 24
  • 36

3 Answers3

6

Maybe your setup is too complex for this but I like to use group combined with a noop task to accomplish something similar. I do it this way because I want to highlight areas which are still synchronous in my pipeline (usually so they can be removed).

Using something similar to your example, I start with a set of tasks which look like this:

tasks.py:

from celery import Celery

app = Celery('tasks', backend="redis", broker='redis://localhost')


@app.task
def add(x, y):
        return x + y


@app.task
def xsum(elements):
    return sum(elements)


@app.task
def noop(ignored):
    return ignored

With these tasks I create a chain using a group to control the results which depend on synchronous results:

In [1]: from tasks import add,xsum,noop
In [2]: from celery import group

# First I run the task which I need the value of later, then I send that result to a group where the first task does nothing and the other tasks are my pipeline.
In [3]: ~(add.si(2, 2) | group(noop.s(),  add.s(4) | add.s(8)))
Out[3]: [4, 16]

# At this point I have a list where the first element is the result of my original task and the second element has the result of my workflow.
In [4]: ~(add.si(2, 2) | group(noop.s(),  add.s(4) | add.s(8)) | xsum.s())
Out[4]: 20

# From here, things can go back to a normal chain
In [5]: ~(add.si(2, 2) | group(noop.s(),  add.s(4) | add.s(8)) | xsum.s() | add.s(1) | add.s(1))
Out[5]: 22

I hope this is useful!

erik-e
  • 3,721
  • 1
  • 21
  • 19
  • 1
    This is brilliant! Thought out of the box and the smallest footprint of all the answers. Unfortunatly the is an open celery bug, which prevents nested groups, so I can't use it at the moment, but I'm l'm looking forward to switch to this one day! – WhatIsName Apr 19 '15 at 15:54
  • Oh, interesting. Do you by chance have a link to the bug? I use nested groups on occasion and would like to find some more information. – erik-e Apr 19 '15 at 22:58
  • @erik-e is there any way to get the last task output and create a group on that in chain? like `(return_range_task.s() | group(add.s(I, 4) for for I in range_task_output) | filter_things.s()` where `return_range_task` will return the array, and `add` task take that array elements as first argument and process them? – Nilesh Mar 24 '21 at 20:35
2

A simple work around is to store results of tasks in a list and use them in your tasks.

from celery import Celery, chain
from celery.signals import task_success

results = []

app = Celery('tasks', backend='amqp', broker='amqp://')


@task_success.connect()
def store_result(**kwargs):
    sender = kwargs.pop('sender')
    result = kwargs.pop('result')
    results.append((sender.name, result))


@app.task
def add(x, y):
    print("previous results", results)
    return x + y

Now, in Your chain, all the previous results can be accessed from any task in any order.

Chillar Anand
  • 27,936
  • 9
  • 119
  • 136
  • Is there any way to do `(return_range_task.s() | group(add.s(I, 4) for for I in range_task_output) | filter_things.s())`? – Nilesh Mar 24 '21 at 20:37
  • You can use celery signals and then trigger the group in task_success signal. @Nilesh – Chillar Anand Mar 25 '21 at 03:45
  • its a same way of doing like https://stackoverflow.com/a/14995090/243031, I can do that in another task, I am looking for something where I can access in same task chain. – Nilesh Mar 25 '21 at 12:30
2

I assign to every chain a job id and I track this job by saving the data in a database.

Launching the queue

if __name__ == "__main__":
  # Generate unique id for the job
  job_id = uuid.uuid4().hex
  # This is the root parent
  parent_level = 1
  # Pack the data. The last value is your value to add
  parameters = job_id, parent_level, 2
  # Build the chain. I added an clean task that removes the data
  # created during the process (if you want it)
  add_chain = add.s(parameters, 2) | add.s(4) | add.s(8)| clean.s()
  add_chain.apply_async()

Now the tasks

#Function for store the result. I used sqlalchemy (mysql) but you can
# change it for whatever you want (distributed file system for example)
@inject.params(entity_manager=EntityManager)
def save_result(job_id, level, result, entity_manager):
  r = Result()
  r.job_id = job_id
  r.level = level
  r.result = result
  entity_manager.add(r)
  entity_manager.commit()

#Restore a result from one parent
@inject.params(entity_manager=EntityManager)
def get_result(job_id, level, entity_manager):
  result = entity_manager.query(Result).filter_by(job_id=job_id, level=level).one()
  return result.result

#Clear the data or do something with the final result
@inject.params(entity_manager=EntityManager)
  def clear(job_id, entity_manager):
  entity_manager.query(Result).filter_by(job_id=job_id).delete()

@app.task()
def add(parameters, number):
  # Extract data from parameters list
  job_id, level, other_number = parameters

  #Load result from your second parent (level - 2)
  #For level 3 parent level - 3 and so on
  #second_parent_result = get_result(job_id, level - 2)

  # do your stuff, I guess you want to add numbers
  result = number + other_number
  save_result(job_id, level, result)

  #Return the result of the sum or anything you want, but you have to send something because the "add" function expects 3 values
  #Of course your should return the actual job and increment the parent level
  return job_id, level + 1, result

@app.task()
def clean(parameters):
  job_id, level, result = parameters
  #Do something with final result or not
  #Clear the data
  clear(job_id)

I use an entity_manager to manages the database operations. My entity manager uses sql alchemy and mysql. I also used a table "result" to storage the partial results. This part should be change for your best storage system (or use this if mysql is ok for you)

from sqlalchemy.orm import sessionmaker
from sqlalchemy import create_engine
import inject

class EntityManager():

  session = None

  @inject.params(config=Configuration)
  def __init__(self, config):
    conf = config['persistence']
    uri = conf['driver'] + "://" + conf['username'] + ":@" + conf['host'] + "/" + conf['database']

    engine = create_engine(uri, echo=conf['debug'])

    Session = sessionmaker(bind=engine)
    self.session = Session()

  def query(self, entity_type):
    return self.session.query(entity_type)

  def add(self, entity):
    return self.session.add(entity)

  def flush(self):
    return self.session.flush()

  def commit(self):
    return self.session.commit()

class Configuration:
  def __init__(self, params):
    f = open(os.environ.get('PYTHONPATH') + '/conf/config.yml')
    self.configMap = yaml.safe_load(f)
    f.close()

  def __getitem__(self, key: str):
    return self.configMap[key]

class Result(Base):
  __tablename__ = 'result'

  id = Column(Integer, primary_key=True)
  job_id = Column(String(255))
  level = Column(Integer)
  result = Column(Integer)

  def __repr__(self):
    return "<Result (job='%s', level='%s', result='%s')>" % (self.job_id, str(self.level), str(self.result))

I used the package inject to get a dependency injector. The inject package will reuse the object so you can inject the access to database every time you want and no worry about the connection.

The class configuration is to load the database access data in a config file. You can replace it and use static data (a map hardcoded) for testing.

Change the dependency injection for any other thing suitable for you. This is just my solution. I just added it for fast test.

The key here is save the partial results somewhere our from the queue system and in the tasks return the data for access to these results (job_id and parent level). You will send this extra (but small) data that is an address (job_id + parent level) that points to the real data (some big stuff).

This solution I what I'm using in my software

Álvaro García
  • 431
  • 2
  • 7
  • Thank you! Honestly I think all three answers are great an deserve the bounty. I went for yours because it produces the least headache by keeping previous results out of my actual result storage. – WhatIsName Apr 19 '15 at 15:48
  • is it possible to read the previous task output in chain and create a group like `(return_range_task.s() | group(add.s(I, 4) for for I in range_task_output) | filter_things.s())` ? – Nilesh Mar 24 '21 at 20:38