4

I have a database that has a list of rows that need to be operated on. It looks something like this:

id       remaining        delivered   locked
============================================
1        10               24          f 
2        6                0           f
3        0                14          f

I am using DataMapper with Ruby, but really I think this is a general programming question that isn't specific to the exact implementation I'm using...

I am creating a bunch of worker threads that do something like this (pseudo-ruby-code):

while true do
  t = any_row_in_database_where_remaining_greater_than_zero_and_unlocked
  t.lock   # update database to set locked = true
  t.do_some_stuff
  t.delivered += 1
  t.remaining -= 1
  t.unlock
end

Of course, the problem is, these threads compete with each other and the whole thing isn't really thread safe. The first line in the while loop can easily pull out the same row in multiple threads before they get a chance to get locked.

I need to make sure one thread is only working on one row at the same time.

What is the best way to do this?

MikeC8
  • 3,783
  • 4
  • 27
  • 33
  • Since Ruby has a global interpreter lock, you're already guaranteed that only one thread will be working at a time. – robbrit Apr 14 '12 at 17:36
  • 1
    I do not believe this is correct...the first line of the thread could be executed in multiple threads.. – MikeC8 Apr 14 '12 at 17:37
  • Ah yeah, you're right, in Ruby 1.9 it is possible to have race conditions. Why not instead of pulling a single row at a time, pull N rows where N is the number of threads and then have each thread operate on one of the returned rows? Then when all the threads are done, pull another N, until you're done. – robbrit Apr 14 '12 at 17:40
  • @robbrit: race conditions between threads are still possible in interpreters with a GIL. – Confusion Apr 14 '12 at 17:47
  • @robrit this is just not efficient enough. why add that extra delay? – MikeC8 Apr 14 '12 at 17:49
  • 1
    @robbrit Not being able to do parallel computing has nothing to do with concurrency `:)`. – Andrew Marshall Apr 14 '12 at 18:37

2 Answers2

6

The key step is when you select an unlocked row from the database and mark it as locked. If you can do that safely then everything else will be fine.

2 ways I know of that can make this safe are pessimistic and optimistic locking. They both rely on your database as the ultimate guarantor when it comes to concurrency.

Pessimistic Locking

Pessimistic locking means acquiring a lock upfront when you select the rows you want to work with, so that no one else can read them. Something like

SELECT * from some_table WHERE ... FOR UPDATE

works with mysql and postgres (and possibly others) and will prevent any other connection to the database from reading the rows returned to you (how granular that lock is depends on the engine used, indexes etc - check your database's documentation). It's called pessimistic because you are assuming that a concurrency problem will occur and acquire the lock preventatively. It does mean that you bear the cost of locking even when not necessary and may reduce your concurrency depending on the granularity of the lock you have.

Optimistic Locking

Optimistic locking refers to a technique where you don't want the burden of a pessimistic lock because most of the time there won't be concurrent updates (if you update the row setting the locked flag to true as soon as you have read the row, the window is relatively small). AFAIK this only works when updating one row at a time

First add an integer column lock_version to the table. Whenever you update the table, increment lock_version by 1 alongside the other updates you are making. Assume the current lock_version is 3. When you update, change the update query to

update some_table set ... where id=12345 and lock_version = 3

and check the number of rows updated (the db driver returns this). if this updates 1 row then you know everything was ok. If this updates 0 rows then either the row you wanted was deleted or its lock version has changed, so you go back to step 1 in your process and search for a new row to work on.

I'm not a datamapper user so I don't know whether it / plugins for it provide support for these approaches. Active Record supports both so you can look there for inspiration if data mapper doesn't.

Frederick Cheung
  • 83,189
  • 8
  • 152
  • 174
1

I would use a Mutex:

# outside your threads
worker_updater = Mutex.new

# inside each thread's updater
while true
  worker_updater.synchronize do
    # your code here
  end
  sleep 0.1 # Slow down there, mister!
end

This guarantees that only one thread at a time can enter the code in the synchronize. For optimal performance, consider what portion of your code needs to be thread-safe (first two lines?) and only wrap that portion in the Mutex.

Phrogz
  • 296,393
  • 112
  • 651
  • 745
  • One thread, yes.. but databases are often accessed by different apps running on different system. This doesn't play into that well.. – baash05 Dec 16 '12 at 22:45
  • @daveatflow Consider the subject of this question and repeated use of the word "thread" inside it. – Phrogz Dec 16 '12 at 23:53