4

After some research , I found the more problems... Below is more detailed Emxaple:

  1. upload a list of urls, set a job_id to all of them(need generate a dynamically queue name for purging).
  2. use Celery tasks to crawl each url, such as extract.delay(job_id, url) and save to db.
  3. (maybe here are many job --- job1, job2, job3) all tasks in jobs are same extract, just one worker process all the queue(How ? I can not tell all queue name to the worker )
  4. check db select count(id) in xxx where job_id = yyy equal to len(urls), or other way as Celery tell me job_id yyy has done.
  5. show this job status(running or complete) in website, can purge job queue on web.

I have never meet this situation, is celery has some easy way to solve my problem?

I need add a job dynamically, one job contain a lot of tasks. All tasks are same .How can I make different jobs have diferent queue name, and just one worker process all the queues? In programmatically.

Mithril
  • 12,947
  • 18
  • 102
  • 153
  • Do you want to know how many are done or just if all of them are done? If you just want all of them as done have you considered using a group? – user2097159 Jun 23 '15 at 12:33

1 Answers1

0

I don't know the details of your web app, but this can be pretty straightforward.

(Using Django syntax)

you could make two models/DB tables. One to represent your batch. And one to represent each URL job

class ScrapeBatch(models.Model):
   id = models.IntegerField()

class ScrapeJob(models.Model):
   batch = models.ForeignKey(ScrapeBatch)
   url = models.CharField(max_length=100) # for example
   done = models.BooleanField(default=False)

Then, when you run your celery tasks, use the ScrapeJob model as your reference

def scrape_url_celery_task(job_id):
     job = ScrapeJob.objects.get(id=job_id)
     scrape_url(job)
     job.done=True
     job.save()

So in your webview, you could simply check if all your batch's jobs are done or not:

def batch_done(batch):
    return not batch.scrapejob_set.filter(done=False).exists()

So in summary: - A DB table that holds your URLS - A DB table to hold something like a batch number (with foreign key relations to your URL table)

  • celery marks URLs as scraped in the DB after the task is completed
  • A simple search through the URL table tells you whether the job is done. You can show this value on a website
rtpg
  • 2,419
  • 1
  • 18
  • 31
  • My current code is just like you, but I think check db every time is not very good.That's why I want to know is there a better way provided, such as a notification or somtthing else to deal with such problem by celery. – Mithril Jun 23 '15 at 05:36