Sidekiq Idempotency, N+1 Queries and deadlocks

Question

In the Sidekiq wiki it talks about the need for jobs to be idempotent and transactional. Conceptually this makes sense to me, and this SO answer has what looks like an effective approach at a small scale. But it's not perfect. Jobs can disappear in the middle of running. We've noticed certain work is incomplete and when we look in the logs they cut short in the middle of the work as if the job just evaporated. Probably due to a server restart or something, but it often doesn't find its way back into the queue. super_fetch tries to address this, but it errs on the side of duplicating jobs. With that we see a lot of jobs that end up running twice simultaneously. Having a database transaction cannot protect us from duplicate work if both transactions start at the same time. We'd need locking to prevent that.

Besides the transaction, though, I haven't been able to figure out a graceful solution when we want to do things in bulk. For example, let's say I need to send out 1000 emails. Options I can think of:

Spawn 1000 jobs, which each individually start a transaction, update a record, and send an email. This seems to be the default, and it is pretty good in terms of idempotency. But it has the side effect of creating a distributed N+1 query, spamming the database and causing user facing slowdowns and timeouts.
Handle all of the emails in one large transaction and accept that emails may be sent more than once, or not at all, depending on the structure. For example:
```
User.transaction do
  users.update_all(email_sent: true)
  users.each { |user| UserMailer.notification(user).deliver_now }
end
```
In the above scenario, if the UserMailer loop halts in the middle due to an error or a server restart, the transaction rolls back and the job goes back into the queue. But any emails that have already been sent can't be recalled, since they're independent of the transaction. So there will be a subset of the emails that get re-sent. Potentially multiple times if there is a code error and the job keeps requeueing.
Handle the emails in small batches of, say, 100, and accept that up to 100 may be sent more than once, or not at all, depending on the structure, as above.

What alternatives am I missing?

One additional problem with any transaction based approach is the risk of deadlocks in PostgreSQL. When a user does something in our system, we may spawn several processes that need to update the record in different ways. In the past the more we've used transactions the more we've had deadlock errors. It's been a couple of years since we went down that path, so maybe more recent versions of PostgreSQL handle deadlock issues better. We tried going one further and locking the record, but then we started getting timeouts on the user side as web processes compete with background jobs for locks.

Is there any systematic way of handling jobs that gracefully copes with these issues? Do I just need to accept the distributed N+1s and layer in more caching to deal with it? Given the fact that we need to use the database to ensure idempotency, it makes me wonder if we should instead be using delayed_job with active_record, since that handles its own locking internally.

Can you explain why the emails would be resent? Can't it be done in phases, like a bulk database action, and then emails once that was succesful? — max pleaner, Jun 27 '20 at 20:31

Kache · Answer 1 · 2020-06-28T06:25:48.857

This is a really complicated/loaded question, as the architecture really depends on more factors than can be concisely described in simple question/answer formats. However, I can give a general recommendation.

Separate Processing From Delivery

start a transaction, update a record, and send an email

Separate these steps out. Better to avoid doing both a DB update and email send inside a transaction, batched or not.

Do all your logic and record updates inside transactions separately from email sends. Do them individually or in bulk or perhaps even in the original web request if it's fast enough. If you save results to the DB, you can use transactions to rollback failures. If you save results as args to email send jobs, make sure processing entire batch succeeds before enqueing the batch. You have flexibility now b/c it's a pure data transform.

Enqueue email send jobs for each of those data transforms. These jobs must do little to no logic & processing! Keep them dead simple, no DB writes -- all processing should have already been done. Only pass values to an email template and send. This is critical b/c this external effect can't be wrapped in a transaction. Making email send jobs a read-only for your system (it "writes" to email, external to your system) also gives you flexibility -- you can cache, read from replicas, etc.

By doing this, you'll separate the DB load for email processing from email sends, and they are now dealt with separately. Bugs in your email processing won't affect email sends. Email send failures won't affect email processing.

Regarding Row Locking & Deadlocks

There shouldn't be any need to lock rows at all anymore -- the transaction around processing is enough to let the DB engine handle it. There also shouldn't be any deadlocks, since no two jobs are reading and writing the same rows.

Response: Jobs that die in the middle

Say the job is killed just after the transaction completes but before the emails go out.

I've reduced the possibility of that happening as much as possible by processing in a transaction separately from email sending, and making email sending as dead simple as possible. Once the transaction commits, there is no more processing to be done, and the only things left to fail are systems generally outside your control (Redis, Sidekiq, the DB, your hosting service, the internet connection, etc).

Response: Duplicate jobs

Two copies of the same job might get pulled off the queue, both checking some flag before it has been set to "processing"

You're using Sidekiq and not writing your own async job system, so you need to consider job system failures out of your scope. What remains are your job performance characteristics and job system configurations. If you're getting duplicate jobs, my guess is your jobs are taking longer to complete than the configured job timeout. Your job is taking so long that Sidekiq thinks it died (since it hasn't reported back success/fail yet), and then spawns another attempt. Speed up or break up the job so it will succeed or fail within the configured timeout, and this will stop happening (99.99% of the time).

Unlike web requests, there's no human on the other side that will decide whether or not to retry in an async job system. This is why your job performance profile needs to be predictable. Once a system gets large enough, I'd expect completely separate job queues and workers based on differences like:

expected job run time
expected job CPU/mem/disk usage
expected job DB or other I/O usage
job read only? write only? both?
jobs hitting external services
jobs users are actively waiting on

Your answer doesn't really account for jobs that die in the middle or duplicate jobs. Say the job is killed just after the transaction completes but before the emails go out. Then the next time it runs (if it runs) it will skip emailing all the records that were updated. — lobati, Jun 28 '20 at 05:08
Similarly, suppose a job gets duplicated and runs simultaneously. In cases where it's important that a job runs only once, a transaction is not going to be enough. Two copies of the same job might get pulled off the queue, both checking some flag before it has been set to "processing", and both assume they should go ahead and do the work. We've seen this and the case above where the job dies in the middle, and both are very messy to clean up. — lobati, Jun 28 '20 at 05:11
It's impossible to make those kinds of 100% guarantees. For any async job system with retries, you must choose between fundamental options: potentially zero or one times OR potentially 1+ times. All you can do is use good engineering to reduce the possibility of failures to practically zero. My suggestion aims to do just that. I'll respond to some of your example cases in my answer above. — Kache, Jun 28 '20 at 05:49
"If you're getting duplicate jobs, my guess is your jobs are taking longer to complete than the configured job timeout." In this case there are often jobs that haven't even started, yet. Sidekiq boots up and sees a job that just went into the queue, then immediately duplicates it. We end up with two of the same job running simultaneously, as in starting in the same second, not just overlapping. — lobati, Jun 29 '20 at 00:18
Okay... if that's true, you can't even be thinking about async job strategy right now. The underlying base system is messed up, and your underlying assumptions have been broken. You're gonna have to dig and figure out if these job duplications are a strange Sidekiq edge case or something else from your own systems. Even if you can't find the culprit, perhaps you can spin up a new isolated job system that won't inherit the problem. By the way, if I've been helpful at all, would appreciate an upvote. — Kache, Jun 29 '20 at 01:40

score 0 · Answer 2 · answered Jun 28 '20 at 09:46

This is a super interesting question but I'm afraid it's nearly impossible to give a "one size fits all" kind of answer that is anything but rather generic. What I can try to answer is your question of individual jobs vs. all jobs at once vs. batching.

In my experience, generally the approach of having a scheduling job that then schedules individual jobs tends to work best. So in a full-blown system I have a schedule defined in clockwork where I schedule a scheduling job which then schedules the individual jobs:

# in config/clock.rb
every(1.day, 'user.usage_report', at: '00:00') do
  UserUsageReportSchedulerJob.perform_now
end

# in app/jobs/user_usage_report_scheduler_job.rb
class UserUsageReportSchedulerJob < ApplicationJob
  def perform
    # need_usage_report is a scope to determine the list of users who need a report.
    # This could, of course, also be "all".
    User.need_usage_report.each(&UserUsageReportJob.method(:perform_later))
  end
end

# in app/jobs/user_usage_report_job.rb
class UserUsageReportJob < ApplicationJob
  def perform(user)
    # the actual report generation
  end
end

If you're worried about concurrency here, tweak Sidekiq's concurrency settings and potentially the connection settings of your PostgreSQL server to allow for the desired level of concurrency. I can say that I've had projects where we've had schedulers that scheduled tens of thousands of individual (small) jobs which Sidekiq then happily took in in batches of 10 or 20 on a low priority queue and processed over a couple of hours with no issues whatsoever for Sidekiq itself, the server, the database etc.

This works if you have some regular task that doesn't depend on user input, but most of ours are in response to some user action. Generating a PDF on request or sending an email. — lobati, Jun 29 '20 at 00:16
Gotcha. However, I don't think this changes much about my answer: You should still keep your jobs atomic and trust the system to eventually reach consistency. If your queue fills up so much that users need to wait longer than a few seconds, you can always take the approach that big boys like Dropbox take and tell the user "We'll send you an email once the report is ready" and then send a link to the report via email once it's available. — Clemens Kofler, Jun 30 '20 at 18:09

Sidekiq Idempotency, N+1 Queries and deadlocks

2 Answers2

Separate Processing From Delivery

Regarding Row Locking & Deadlocks