6

We have a script that downloads documents from various sources periodically. I'm going to move this over to celery, but while doing so, I wanted to take advantage of connection pooling at the same time, but I wasn't sure how to go about it.

My current thought is to do something like this using Requests:

import celery
import requests

s = requests.session()

@celery.task(retry=2)
def get_doc(url):
    doc = s.get(url)
    #do stuff with doc

But I'm concerned that the connections will stay open indefinitely.

I really only need the connections to stay open so long as I'm processing new documents.

So something like this possible:

import celery
import requests


def get_all_docs()
    docs = Doc.objects.filter(some_filter=True)
    s = requests.session()
    for doc in docs: t=get_doc.delay(doc.url, s)

@celery.task(retry=2)
def get_doc(url):
    doc = s.get(url)
    #do stuff with doc

However, in this case, I'm not certain that the connection sessions will persist across instances, or if Requests will create new connections once the pickling / unpickling is complete.

Lastly, I could try the experimental support for task decorators on a class method, so something like this:

import celery
import requests


class GetDoc(object):
    def __init__(self):
        self.s = requests.session()

@celery.task(retry=2)
def get_doc(url):
    doc = self.s.get(url)
    #do stuff with doc

The last one seems like this best approach, and I'm going to test this; however, I was wondering if anyone here has already done something similar to this, or if not, one of you reading this might have a better approach than one of the above methods.

Jeremy
  • 1
  • 85
  • 340
  • 366
James R
  • 4,571
  • 3
  • 30
  • 45
  • 2
    I suspect you're right. I'm not an expert in the inner workings of Celery but from what I understand is that each job is run by separate workers and you'll have no guarantee that worker A performing a request to google.com will also perform the next request to google.com. I imagine that resource sharing across tasks is inherently against what Celery does, unless there is a specific Celery design feature to support this. – shazow Sep 07 '12 at 17:12
  • I'm thinking about this exact same thing. Did you ever come up with a solution? – mlissner Nov 19 '16 at 06:20
  • I would love to know the answer to this as well – digitaldavenyc Jan 02 '17 at 18:40

0 Answers0