Virtual SCSI disk device driver on OS X causes panics on when device formatted using HFS+ only

Question

I’m currently developing a virtual SCSI device driver using IOKit. My driver loads fine and I’m able to format my device using ExFat, FAT32 or HFS+.

I am also able to copy files to and from my virtual device using a CLI with any of these file systems. Using either ExFat or FAT32, I can also copy or view files from my virtual device using Finder, but not when the drive is formatted HFS+ (Journaled).

When using HFS+, I get kernel panics associated with either Finder, Spotlight or Preview. This occurs, for example, when the system tries to preview the file after it has been copied to the disk. Here's a sample log from a kernel panic:

Mon Jun 30 22:47:51 2014
panic(cpu 4 caller 0xffffff8005dd1401): "cluster_read_copy: failed to get pagelist"@/SourceCache/xnu/xnu-2422.100.13/bsd/vfs/vfs_cluster.c:3732
Backtrace (CPU 4), Frame : Return Address
0xffffff80e9aa3810 : 0xffffff8005c22fa9
0xffffff80e9aa3890 : 0xffffff8005dd1401
0xffffff80e9aa3a30 : 0xffffff8005dd0a85
0xffffff80e9aa3cf0 : 0xffffff8005f6b876
0xffffff80e9aa3d80 : 0xffffff8005dfdd21
0xffffff80e9aa3e00 : 0xffffff8005df38d5
0xffffff80e9aa3e50 : 0xffffff8005ff1cfe
0xffffff80e9aa3ef0 : 0xffffff8005ff1e79
0xffffff80e9aa3f50 : 0xffffff8006040653
0xffffff80e9aa3fb0 : 0xffffff8005cf3c56

BSD process name corresponding to current thread: Finder
Boot args: debug=0x144

Mac OS version:
13D65

Kernel version:
Darwin Kernel Version 13.2.0: Thu Apr 17 23:03:13 PDT 2014; root:xnu-2422.100.13~1/RELEASE_X86_64

Here's the stack trace for the above panic log:

mach_kernel`panic + 201 at debug.c:353
mach_kernel`cluster_read_copy + 2001 at vfs_cluster.c:3732
mach_kernel`cluster_read_ext + 5045 at vfs_cluster.c:3441
mach_kernel`hfs_vnop_read + 566 at hfs_readwrite.c
mach_kernel`VNOP_READ + 225 at kpi_vfs.c:3247
mach_kernel`vn_read + 245 at vfs_vnops.c:939
mach_kernel`dofileread + 174 at sys_generic.c:377
mach_kernel`pread_nocancel + 137 at sys_generic.c:266
mach_kernel`unix_syscall64 + 499 at systemcalls.c:370
mach_kernel`hndl_unix_scall64 + 22

I've had difficulty determining the problem but as implied above have narrowed down to HFS+. Is there something special about HFS+ that could cause this?

Since the problem has to do with the VFS and paging(?) I’m beginning to wonder whether I’m handling my IOMemoryDescriptors correctly (I believe I am, based on what I've read in Apple docs). Anyone experience similar problems? Is this typically indicative of a particular type of problem?

UPDATE: I've verified that disabling journaling in HFS+ does not have an effect on the above results.

UPDATE2: The panic logs are consistent each time. The panic occurs only when ALL of the following conditions have been met:

HFS+ is used
File is copied to the device (for a single file > a few MB)
File is copied using Finder (but not terminal, for example)

The kernel consistently panics right as the first file finishes copying to the device. Many small files equal to or larger than several MB can be copied without causing a panic. I can even copy files large files (> 30 MB) to the volume using terminal and read / open them in Finder.

Surely there's some way to get symbols for your stack trace, no? — Gabe, Jul 08 '14 at 01:41
@Gabe I'm not aware of debug symbols for Finder, so I'm not sure where that will get me (I also don't at the moment have a two-machine setup which is required for live debugging or core dump analysis). However, the panic text pinpoints the kernel source that's inducing the panic (vfs_cluster.c:3732) which I've looked at to no avail. [ — user3814483, Jul 08 '14 at 23:43
The addresses starting with `0xf` tell me that's a kernel stack trace, so Finder isn't represented in it. You need kernel symbols, which I'd expect to be available in a "kernel debug kit". If you don't have two machines to debug with, can you install your driver in a VM and debug the VM guest from the host? — Gabe, Jul 09 '14 at 00:29
@Gabe good call. I didn't realize you could resolve the addresses without live debugging. Kernel slide address is key. — user3814483, Jul 09 '14 at 01:14
Another key thing to mention: is the stack trace consistent between crashes, or does it vary somewhat? — Gabe, Jul 09 '14 at 22:46
@Gabe the stack trace is consistent. The panic occurs consistently, not randomly. That is, it always occurs for HFS+ only, and only when Finder is used (using terminal to copy the same files doesn't cause a panic, which is bizarre), and ONLY occurs for large files (hard to put a threshold but somewhere around 10 MB) and consistently happens as the firs file finishes copying. Copying many small files of the same total size is not an issue. — user3814483, Jul 09 '14 at 23:47

score 1 · Answer 1 · answered Jul 09 '14 at 08:58

The kernel source suggests that the panic is due to a failing call to ubc_create_upl(). The comments in the source for that function in turn suggest that the return modes are as follows:

 * Returns: KERN_SUCCESS        The requested upl has been created
 *      KERN_INVALID_ARGUMENT   The bufsize argument is not an even
 *                  multiple of the page size
 *      KERN_INVALID_ARGUMENT   There is no ubc_info associated with
 *                  the vnode, or there is no memory object
 *                  control associated with the ubc_info 
 *  memory_object_upl_request:KERN_INVALID_VALUE
 *                  The supplied upl_flags argument is
 *                  invalid

From the disassembly and your panic log's register contents (which you haven't provided in the question) it might be possible to determine the actual return code. Not that this seems like it will help much, as your code will not have been invoked yet on this particular syscall; I strongly suspect you're causing some kernel resource to be depleted, and this just happens to be the point where a pool of some kind runs dry completely. Quite possibly memory descriptors as you say, as the UPL created here will be wrapped in a memory descriptor and then passed to the mount's block device's IOKit back-end.

You could check ioclasscount and maybe zprint before you load your driver and then periodically dump their output again until the crash, and then compare outputs for anything that's running away. Just as good an idea will be to audit your code for matching prepare()/complete() and retain()/release() calls, particularly on the request memory descriptors.

2-machine debugging may help too; for one you can check what file the vnode used for the VNOP_READ corresponds to. If any crashes occur on a file that isn't even on your device's volume, then you've almost certainly got some resource depletion or memory corruption. You'll be able to track down the exact reason the ubc_create_upl() call is failing. And you can also instrument your code to collect data you can then inspect in the debugger.

Thanks, I'm working on getting a two-machine debugger set up. Using an OS X VM on OS X is unusably slow on my machine with VirtualBox, for some reason. Note: I'm subclassing IOSCSIParallelInterfaceController, which automatically does a prepare() and a complete() for me. It may very well be a resource exhaustion issue, but my concern there is why would my design/implementation bug only affect HFS+, and only when a file is copied using Finder? (HFS+ and using terminal to copy the same files). I've updated the original post above with these details. — user3814483, Jul 09 '14 at 23:51
The involvement of Finder and HFS+ makes me suspect a concurrency issue. Finder does extra work on HFS volumes to manage metadata that isn't available on FAT. Could your driver be holding a resource that blocks a concurrent read attempt? — Variable Length Coder, Jul 10 '14 at 01:26
@VariableLengthCoder Interesting. As SCSI tasks come in, I queue them. Only one task is processed at a time. So if there's a second read operation (concurrent) it would be queued and processed after the first operation is completed. The driver stack is notified that the request is in progress when this occurs using standard methods built into IOKit's SCSI family. There are mutex locks on the queue and on read and write operations. I'll pour over my code to see if theres any possibility of deadlock for corner cases. — user3814483, Jul 10 '14 at 01:45
Yeah, I'm thinking the HFS/Finder connection is purely that the access pattern exposes a particular flaw in your code. On the plus side, it's nice to have a reliable repro! — pmdj, Jul 11 '14 at 10:06

Virtual SCSI disk device driver on OS X causes panics on when device formatted using HFS+ only

1 Answers1