I'm running into an odd situation where celery would reprocess a task that's been completed. The overall design looks like this:
Celery Beat: Pulls files periodically, if a file was pulled it creates a new entry in the DB and delegates processing of that file to another celery task in a 1 worker queue (that way only 1 file gets processed at a time)
Celery Task: Processes the file, once it's done it's done, no retries, no loops.
@app.task(name='periodic_pull_file')
def periodic_pull_file():
for f in get_files_from_some_dir(...):
ingested_file = IngestedFile(filename=filename)
ingested_file.document.save(filename, File(f))
ingested_file.save()
process_import(ingested_file.id)
#deletes the file from the dir source
os.remove(....somepath)
def process_import(ingested_file_id):
ingested_file = IngestedFile.objects.get(id=ingested_file_id)
if 'foo' in ingested_file.filename.lower():
f = process_foo
else:
f = process_real_stuff
f.apply_async(args=[ingested_file_id], queue='import')
@app.task(name='process_real_stuff')
def process_real_stuff(file_id):
#dostuff
process_foo and process_real_stuff is just a function that loops over the file once and once it's done it's done. I can actually keep track of the percentage of where it's at and the interesting thing I noticed was that the same file kept getting processed over and over again (note that these are large files and processing is slow, takes hours to process. Now I started wondering if it was just creating duplicate tasks in the queue. I checked my redis queue when I have 13 pending files to import:
-bash-4.1$ redis-cli -p 6380 llen import
(integer) 13
And aha, 13, I checked the content of each queued task to see if it was just repeated ingested_file_ids using:
redis-cli -p 6380 lrange import 0 -1
And they're all unique tasks with unique ingested_file_id. Am I overlooking something? Is there any reason why it would finish a task -> loop over the same task over and over again? This only started happening recently with no code changes. Before things used to be pretty snappy and seamless. I know it's also not from a "failed" process that somehow magically retries itself because it's not moving down in the queue. i.e. it's receiving the same task in the same order again and again, so it never gets to touch the other 13 files it should've processed.
Note, this is my worker:
python manage.py celery worker -A myapp -l info -c 1 -Q import