3

I've been using Heroku to host my application for several years and just started running into issues with the worker queue getting backlogged. I was hoping I could fix this by increasing the number of workers running so queued jobs could be completed in parallel, but whenever I scale up my number of workers, all but one crash.

Here's my Procfile:

web: vendor/bin/heroku-php-apache2 public
worker: php /app/artisan queue:restart && php /app/artisan queue:work redis --tries=3 --timeout=30

Here's the output from my sever logs when I scale up my workers to anything greater than 1 (in this example, it was just scaling it to 2 workers):

    Mar 16 06:04:51 heroku/worker.1 Starting process with command `php /app/artisan queue:restart && php /app/artisan queue:work redis --tries=3 --timeout=30`
    Mar 16 06:04:52 heroku/worker.1 State changed from starting to up
    Mar 16 06:04:54 app/worker.1 Broadcasting queue restart signal.
    Mar 16 06:04:58 heroku/worker.2 Process exited with status 0
    Mar 16 06:04:58 heroku/worker.2 State changed from up to crashed
    Mar 16 06:04:58 heroku/worker.2 State changed from crashed to starting
    Mar 16 06:05:09 heroku/worker.2 Starting process with command `php /app/artisan queue:restart && php /app/artisan queue:work redis --tries=3 --timeout=30`
    Mar 16 06:05:10 heroku/worker.2 State changed from starting to up
    Mar 16 06:05:14 app/worker.2 Broadcasting queue restart signal.
    Mar 16 06:05:19 heroku/worker.1 Process exited with status 0
    Mar 16 06:05:19 heroku/worker.1 State changed from up to crashed

As you can see, both workers try starting but only worker.2 stays in the up status.

The crashed workers try restarting every 10 minutes to the same result as above.

When I run heroku ps, here's what I see:

    === worker (Standard-1X): php /app/artisan queue:restart && php /app/artisan queue:work redis --tries=3 --timeout=30 (2)
    worker.1: crashed 2021/03/16 06:05:19 -0600 (~ 20m ago)
    worker.2: up 2021/03/16 06:05:10 -0600 (~ 20m ago)

(my normal web dynos scale up and down just fine, so i'm not showing that in here).

Any thoughts as to what could be happening? My first thought was that there was an issue going on with Heroku, but I realized that wasn't the case. My second thought is that my Procfile entry for my worker could be causing problems, but I don't know enough about that entry to know what could be the cause.

Again, this has been working fine for 1 worker for a long time and the crashing only happens when I try to scale up to more than 1 worker. Regardless of how many workers I try scaling to, only one doesn't crash and remains active and able to receive and process jobs.

Misc info:

  • Heroku stack: Heroku-18
  • Laravel version: 8.*
  • Queue driver: Redis

Update - I scaled up the dynos on my staging environment and was able to scale the workers up and down without any kind of crashes. Now I'm thinking there might be some kind of add-on conflict or something else going on. I'll update this if I find anything else out (already reached out to Heroku support).

BrandonO
  • 231
  • 2
  • 10
  • Maybe these links could help you https://stackoverflow.com/questions/9659907/heroku-dyno-worker-crashes-at-start or this https://stackoverflow.com/questions/9964303/rails-3-1-heroku-workers-crashing – Basharmal Mar 16 '21 at 19:36
  • I saw those links when I was researching this earlier. Unfortunately they don't seem to apply to my problem exactly and don't explain why having one worker runs just fine, but becomes problematic when trying to scale up. – BrandonO Mar 16 '21 at 19:47

1 Answers1

2

The problem was the php /app/artisan queue:restart command in the procfile. The workers starting and the restart command being called were causing conflicting signals and eventually caused all but one of the workers to crash.

I took out that command and I can scale my workers without issue now.

=== worker (Standard-1X): php /app/artisan queue:work redis --queue=high,default,sync,emails,cron --tries=3 --timeout=30 (2)
worker.1: up 2021/03/17 17:29:32 -0600 (~ 8m ago)
worker.2: up 2021/03/17 17:35:58 -0600 (~ 2m ago)

When a deployment is made to Heroku, the dynos receive a SIGTERM signal which kills any lingering processes and then the dynos are restarted. This means the php /app/artisan queue:restart command was redundant and unnecessary.

The main confusion came in the way Laravel worded the information about queue workers needing a restart here: https://laravel.com/docs/8.x/queues#queue-workers-and-deployment. This is necessary on servers where the dynos aren't handled the way Heroku does.

BrandonO
  • 231
  • 2
  • 10
  • Good job spotting the issue. The `queue:restart` command name is misleading - all it does is set a flag in the cache that all queue workers will read before starting a new job. If the flag has a new value, they exit and assume a supervisor will restart them. So when you were scaling up, each time heroku was adding a new dyno it was also killing all others. It does restart the process but there's a [backoff policy](https://devcenter.heroku.com/articles/dynos#dyno-crash-restart-policy) that makes it unreliable. – Robin Apr 27 '21 at 08:52