2

On Linux (say, Ubuntu/Debian) I would like to create a virtual block device (let's say /dev/mapper/myvbd ) that is backed by a bunch of files on the user's home (say /home/myuser/myvbdfiles/file[1...100] ).

If it were a single file I could do it trivially using losetup, but what I would like to do is writing an application or a kernel module that, while in execution, creates the virtual block device and maps I/O requests made by the user on that device to arbitrary positions on any of the files on disk (according to an algorithm that I want to write, maybe provided by a library).

I have written a test proof of concept using FUSE and Python, but I would like to do it in C. What do you think is the best way to do it? Any hint or resource I can look at?

zambowalla
  • 21
  • 4
  • Note that FUSE is operating on filesystem level while you seem to need to operate on block-device level. So fuse is not the appropriate interface for your project. – Ctx Jul 08 '20 at 12:58
  • The way I do in my PoC is: first I use FUSE to map a bunch of files into a unique (virtual) file on a different directory, then I mount that big file with losetup. It works but it's slow – zambowalla Jul 09 '20 at 11:20

1 Answers1

0

If the mapping is fixed, i.e. 512-byte sector i is always in file fi sector bi, then you can use a device mapper table device via dmsetup.

Each line in the table file specifies a mapping. In your case, they could be simply

    i   1   linear   fi   bi


It might be interesting to create a derivative of dm-switch, say dm-chunkmap, that redefines region_table as a per-chunk mapping (with a configurable number, 2n 512-byte sectors per chunk): low bits specify the device/file, and high bits the target sector on that device.

Using 32 bits per table entry, the maximum size would be two terabytes (but 8 MiB of kernel RAM would be needed per gigabyte for the mapping); using groups of eight sectors 16 TiB (and only a megabyte of kernel RAM per gigabyte of block device would be needed).

Something along the lines of

/* Each chunk has device index in low bits,
   and target sector number in upper bits. */
typedef unsigned chunk_t;

/*
 * Context block for a dm chunk-mapping device.
 */
struct chunkmap_ctx {
    struct dm_target *ti;
    unsigned nr_paths;        /* Number of paths (devices) in path_list. */
    unsigned dev_shift;      /* nr_paths <= 1 << dev_shift */
    unsigned dev_mask;       /* (1 << dev_shift) - 1 */
    unsigned chunk_shift;    /* 1 << chunk_shift sectors per region */
    unsigned chunk_mask;     /* (1 << chunk_shift) - 1 */
    unsigned long nr_chunks; /* Number of regions making up the device */
    chunk_t *chunk_table;
    struct dm_dev *dev_list[]; /* Devices */
};

static int chunkmap_map(struct dm_target *ti, struct bio *bio)
{
    struct chunkmap_ctx *rctx = ti->private;
    sector_t offset = dm_target_offset(ti, bio->bi_iter.bi_sector);
    chunk_t chunk = READ_ONCE(rctx->chunk_table[offset >> rctx->chunk_shift]);
    bio_set_dev(bio, rctx->dev_list[chunk & rctx->dev_mask]);
    bio->bi_iter.bi_sector = (sector_t)(chunk >> rctx->dev_shift) << rctx->chunk_mask
                           + (offset & rctx->chunk_mask);
}

static struct target_type chunkmap_target = {
    .name = "chunkmap",
    .version = {1, 1, 0},
    .module = THIS_MODULE,
    .ctr = /* chunkmap_ctr */,
    .dtr = /* chunkmap_dtr */,
    .map = chunkmap_map,
    .message = /* chunkmap_message */,
    .status = /* chunkmap_status */,
    .prepare_ioctl = /* chunkmap_prepare_ioctl */,
    .iterate_devices = /* chunkmap_iterate_devices */,
};

which allows using chunks of 2chunk_shift sectors (29 + chunk_shift bytes) per chunk, and up to 2dev_shift target devices/paths, with the chunks ordered on the target devices completely arbitrarily.

It should be noted that the structure does allow mapping different sectors to the same target device sector, but this should be considered an error, since it can result in errors. In other words, it is a good idea to only allow unique chunk_table entries.

If there is an use case for something like this, outside of one-off experiment, I'm pretty sure it could be pushed into mainline kernels via dm-devel mailing list.

None
  • 281
  • 1
  • 3
  • Thanks for the reply. Unfortunately in my case the mapping should be dynamic, depending on some secret key owned by the user. Basically I want to implement an [ORAM](https://en.wikipedia.org/wiki/Oblivious_RAM) and I want to have the nodes of the ORAM mapped to single files. – zambowalla Jul 09 '20 at 12:39
  • @zambowalla: Block devices are accessed in 512-byte units, so a block device approach is not suitable for that case. What you *can* do, however, is to have your file-mapper create and memory-map a file in /dev/shm (or any tmpfs) with MAP_NORESERVE, and mlock() it in memory, to avoid it being swapped out. Then, your program populates it from the files. Then, you do a loopback mount on that memory-mapped file. Then, your program monitors for changes in the memory map, and saves them to the proper files. This allows an arbitrary mapping at a byte granularity. – None Jul 09 '20 at 15:55
  • @zambowalla: The downside is that there is currently no easy way to tell which pages have been written to. However, the program can just periodically store the current state to the files, or just when exiting. If the mapping program runs as root, you can have an user agent process running as the user, with a pipe between them. If the user agent process gets killed, the root program can detect that, and save the data; the user agent process is just then a "canary", or tells the root when to tear down the mount. – None Jul 09 '20 at 15:59
  • @zambowalla: If you want, I can describe this with some example code as another answer. – None Jul 09 '20 at 16:00
  • That would be great, thanks, I'm quite confused :) one thing I didn't understand is why the 512 bytes limit should be a problem: the way I thought it is, when a user wants to read, say 1024 bytes, my mapper would split this into two separate 512-byte blocks, and then perform two separate read operations on the virtual block device. Do you think this would slow down I/O considerably? – zambowalla Jul 10 '20 at 17:20
  • @zambowalla: No, that's exactly what device-mapper does. Do note that the mapping is fixed only for the duration of the mount, and you set the mapping just before mounting the file system. So, in the sense it is dynamic; you just can't change it while the filesystem on top of it is mounted. (If you did change it, how would the contents stay consistent? The block device does not know which sectors are used and which not, and changing how an used sector is mapped would replace its contents.) – None Jul 10 '20 at 19:39
  • @zambowalla: I do suspect that the dm-chunkmap would be perfect for your needs, in other words. The mapping, after set up, is only stored in-kernel, and kept "secret". For each mount, the mapping can be different, so basically the user password/key generates the mapping. Is this dynamic enough, or do you need to change the mapping while the filesystem is being used (mounted) on top of it? If it is, then dm-chunkmap would work; otherwise use device mapper to concatenate all files, and write your own FUSE file system on top of that. – None Jul 10 '20 at 19:50