2

I've got an OS X app (Yosemite, 10.10) that performs long-running jobs involving heavy use of sqlite across multiple threads. And I've hit a deadlock across 8 threads, all of which are stuck in sqlite code connecting to different database files. There is no apparent resource-related connection between them. I'm debugging it on a new Mac Pro (late 2013).

Four of them are at this stack. Of those, three are operating on the same table (again, different database files); three are updating and one is querying.

__psynch_mutexwait
_pthread_mutex_lock
unixLock
sqlite3PagerSharedLock
sqlite3BtreeBeginTrans
sqlite3VdbeExec
sqlite3_step

One is at this stack, updating the same table as three of the ones at the stack above.

guarded_close_np
nolockClose
pager_end_transaction
sqlite3BtreeCommitPhaseTwo
sqlite3VdbeHalt
sqlite3VdbeExec
sqlite3_step

Two are at this stack, opening database files with the same name in different locations. The open mode is SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE | SQLITE_OPEN_FULLMUTEX.

__psynch_mutexwait
_pthread_mutex_lock
sqlite3ParseUri
openDatabase

One is at this stack, ending a transaction.

__psynch_mutexwait
_pthread_mutex_lock
unixLock
sqlite3VdbeHalt
sqlite3VdbeExec
sqlite3_step

So, the question: what could cause sqlite to deadlock without involving any shared resources?

Update: I've now got seven threads locked calling sqlite3_open_v2 and one on sqlite3_close, all on different database files (several with the same name, but in different folders). The stack is:

__psynch_mutexwait
_pthread_mutex_lock
sqlite3ParseUri
openDatabase

The close stack is:

guarded_close_np
unixClose
sqlite3PagerClose
sqlite3BtreeClose
sqlite3LeaveMutexAndCloseZombie
sqlite3Close

I was able to get it to go longer before locking up by fixing some memory leaks (this isn't running with ARC) and removing some transaction statements.

Update 2: I've hooked up SQLITE_LOG via sqlite3_config (documentation) and I'm seeing a fair amount of logging of code 28 (sqlite_warning) with the message "file renamed while open:".

Update 3: I wiped the machine and did a fresh install of Yosemite in an attempt to rule out file system issues. I'm still locking up the same way. It'll run for several minutes and then one by one the threads lock up. There's one at guarded_close_np, stuck in assembly at an instruction jae <address here> where the address being jumped to has the instruction retq. I've asked a separate question about the sqlite log message about files being renamed in hopes that it's related.

Community
  • 1
  • 1
Tom Hamming
  • 10,577
  • 11
  • 71
  • 145
  • I suspect that you are mistaken, and that some are accessing the same file. But note that SQLite creates a "journal" (WAL) file in some configurations, and it's possible that even though the SQLite files are distinct they're attempting to use the same journal file. – Hot Licks Jan 06 '15 at 21:11
  • No, I just went through all the threads again and inspected all the paths being passed to the sqlite functions. If indeed there are multiple threads hitting the same file(s), it's happening in a really, really weird way. And the DB files are created with WAL journaling and then converted to DELETE (they're only going to be used as read-only in the future). – Tom Hamming Jan 06 '15 at 21:52
  • I'm not really into OSx stuff, but is there a tool that will show you the open files in a process? – Hot Licks Jan 06 '15 at 21:56
  • [Yes](http://stackoverflow.com/a/26862998/412107). I see a lot of sqlite-shm file handles open, but none that correspond exactly to the particular files that are locked up. [SQLite docs](https://www.sqlite.org/tempfiles.html) state that -shm files should be removed when wal files are removed, but that's not happening in my case. When I convert to DELETE (`PRAGMA journal_mode=DELETE`) the wal files go away but the shm files stay. – Tom Hamming Jan 06 '15 at 22:15
  • I don't have an answer for this, but I wanted to register a "me too" - I'm seeing something very similar with three threads all stuck in `sqlite3_step` on different databases. One of the threads is in `guarded_close_np`, and the others are in `__psynch_mutexwait`. – craig65535 Jan 07 '15 at 07:27

2 Answers2

2

It sounds like you're getting stuck on the UNIX master mutex, which needs to be acquired before file closing:

/*
** Close a file.
*/
static int unixClose(sqlite3_file *id){  
  int rc = SQLITE_OK;
  unixFile *pFile = (unixFile *)id;
  verifyDbFile(pFile);
  unixUnlock(id, NO_LOCK);
  unixEnterMutex(); <- HERE
...

This mutex is mostly held during low-level file operations. You'll have to find the thread that holds the mutex and see what it's doing. Perhaps it's operating on a slow or broken filesystem.

David Schwartz
  • 179,497
  • 17
  • 214
  • 278
  • I have had some other seemingly file-related issues (Finder locking up) on this machine that I thought I had fixed. I'll run some diagnostic tools. – Tom Hamming Jan 06 '15 at 23:49
  • I debugged the app on a coworker's machine (also a new Mac Pro) and it locked up the same way, except that no threads were in `guarded_close_np`. So it's not an issue specific to my machine. And when it does lock on my machine, the assembly of `guarded_close_np` is locked at `jae
    `, [jumping](http://en.wikibooks.org/wiki/X86_Assembly/Control_Flow#Jump_on_Inequality) to an address whose instruction is `retq`.
    – Tom Hamming Jan 07 '15 at 16:58
  • My guess is this is a bug specific to the version of sqlite that shipped with Yosemite. Unfortunately I can't find the source code. `sqlite --version` says 3.8.5 and `c8ade949d4a2eb3bba4702a4a0e17b405e9b6ace` but that commit does not exist on sqlite.org – craig65535 Jan 07 '15 at 21:00
  • @craig65535 - that's the version I have too. The sqlite warning message I mentioned is not generated by the same code running on iOS under sqlite version 3.7.13. – Tom Hamming Jan 07 '15 at 22:30
1

My app was using sqlite by linking against the dynamic library. I downloaded the latest SQLite source amalgamation from here (3.8.7.4 as of writing) and compiled it right into the app, and everything started working. So perhaps it was a bug in 3.8.5. Apparently compiling the source into the app directly is the recommended way to use sqlite anyway.

I still don't know exactly what caused the issue. The only thing I can think of is that it's something to do with how I'm creating the database files: I'm creating an empty file using NSFileManager createFileAtPath and then passing it to sqlite3_open_v2 with SQLITE_OPEN_CREATE as part of the flags parameter. So it's writing a database into an existing file instead of creating a database file at the specified location.

Tom Hamming
  • 10,577
  • 11
  • 71
  • 145