How to allocate buffer for C library call

Question

The question is not new and there are two approaches, as far as I can tell:

Use Vec<T>, as suggested here
Manage the heap memory yourself, using std::alloc::alloc, as shown here

My question is whether these are indeed the two (good) alternatives.

Just to make this perfectly clear: Both approaches work. The question is whether there is another, maybe preferred way. The example below is introduced to identify where use of Vec is not good, and where other approaches therefore may be better.

Let's state the problem: Suppose there's a C library that requires some buffer to write into. This could be a compression library, for example. It is easiest to have Rust allocate the heap memory and manage it instead of allocating in C/C++ with malloc/new and then somehow passing ownership to Rust.

Let's go with the compression example. If the library allows incremental (streaming) compression, then I would need a buffer that keeps track of some offset.

Following approach 1 (that is: "abuse" Vec<T>) I would wrap Vec and use len and capacity for my purposes:

/// `Buffer` is basically a Vec
pub struct Buffer<T>(Vec<T>);

impl<T> Buffer<T> {
    /// Create new buffer of length `len`
    pub fn new(len: usize) -> Self {
        Buffer(Vec::with_capacity(len))
    }
    /// Return length of `Buffer`
    pub fn len(&self) -> usize {
        return self.0.len()
    }
    /// Return total allocated size of `Buffer`
    pub fn capacity(&self) -> usize {
        return self.0.capacity()
    }
    /// Return remaining length of `Buffer`
    pub fn remaining(&self) -> usize {
        return self.0.capacity() - self.len()
    }
    /// Increment the offset
    pub fn increment(&mut self, by:usize) {
        unsafe { self.0.set_len(self.0.len()+by); }
    }
    /// Returns an unsafe mutable pointer to the buffer
    pub fn as_mut_ptr(&mut self) -> *mut T {
        unsafe { self.0.as_mut_ptr().add(self.0.len()) }
    }
    /// Returns ref to `Vec<T>` inside `Buffer`
    pub fn as_vec(&self) -> &Vec<T> {
        &self.0
    }
}

The only interesting functions are increment and as_mut_ptr.

Buffer would be used like this

fn main() {
    // allocate buffer for compressed data
    let mut buf: Buffer<u8> = Buffer::new(1024);
    loop {
        // perform C function call
        let compressed_len: usize = compress(some_input, buf.as_mut_ptr(), buf.remaining());
        // increment
        buf.increment(compressed_len);
    }
    // get Vec inside buf
    let compressed_data = buf.as_vec();
}

Buffer<T> as shown here is clearly dangerous, for example if any reference type is used. Even T=bool may result in undefined behaviour. But the problems with uninitialised instance of T can be avoided by introducing a trait that limits the possible types T.

Also, if alignment matters, then Buffer<T> is not a good idea.

But otherwise, is such a Buffer<T> really the best way to do this?

There doesn't seem to be an out-of-the box solution. The bytes crate comes close, it offers a "container for storing and operating on contiguous slices of memory", but the interface is not flexible enough.

@ChayimFriedman Oh yes, indeed. The rust-lang docs says "Notice that the rules around uninitialized integers are not finalized yet, but until they are, it is advisable to avoid them." — mcmayer, Oct 02 '22 at 15:09
IIRC they were just recently finalized to be UB. But I'm not sure the FCP is done yet. — Chayim Friedman, Oct 02 '22 at 15:11
Is there a reason why you don't initialize the entire vector, to avoid the uninitialized memory problem, and store the amount of data that is written as an offset value? And then you can return a slice to the filled part of the buffer. Or do you want to avoid the overhead of writing zeros in the rest of the buffer? — Finomnis, Oct 02 '22 at 15:28
@Finomnis I just find it wasteful. In C I wouldn't have to initialize. In a way, the data is initialized by calling compress. — mcmayer, Oct 02 '22 at 15:30
It sounds like what you are doing is sound. See [this](https://stackoverflow.com/questions/30979334/safety-of-set-len-operation-on-vec-with-predefined-capacity), [this](https://stackoverflow.com/questions/70085309/how-can-i-fill-an-uninitialized-rust-vector-using-a-c-function) and [this](https://github.com/rust-lang/rust-clippy/issues/4483). — Finomnis, Oct 02 '22 at 15:33
Does this answer your question? [How can I fill an uninitialized Rust vector using a C function?](https://stackoverflow.com/questions/70085309/how-can-i-fill-an-uninitialized-rust-vector-using-a-c-function) — Finomnis, Oct 02 '22 at 15:35
@mcmayer Don't worry about waste, memset is *very* fast compared to your compression algorithm. — Finomnis, Oct 02 '22 at 15:36
Quick note, `increment` is unsound as written, either add a bounds check or mark it unsafe. — Aiden4, Oct 02 '22 at 15:37
@Finomnis This is not even memset, Rust requests zeroed memory from the OS and if you immediately write into it it is never actually get zeroed. — Chayim Friedman, Oct 02 '22 at 15:43
@ChayimFriedman So the MMU zeroes it on read? Or how do I interpret what you are saying? *EDIT*: probably off-topic — Finomnis, Oct 02 '22 at 15:44
@Finomnis I _think_ the OS defines the page to be write-only, and if it is being read there is a page trap and the OS catches it and zeroes the page. But I'm not sure. — Chayim Friedman, Oct 02 '22 at 16:00
Silly question: do you actually need a heap-allocated buffer? Maybe you simply need a `MaybeUninit<[u8; 1024]>`? — user4815162342, Oct 02 '22 at 16:26
@user4815162342 a `[MaybeUninit; 1024]` would probably be better. But yes, if you are using a smallish buffer with a size known at compile time that would be the way to go. — Aiden4, Oct 02 '22 at 16:43

kmdreko · Accepted Answer · 2022-10-02T15:55:13.393

You absolutely can use a Vec's spare capacity as to write into manually. That is why .set_len() is available. However, compress() must know that the given pointer is pointing to uninitialized memory and thus is not allowed to read from it (unless written to first) and you must guarantee that the returned length is the number of bytes initialized. I think these rules are roughly the same between Rust and C or C++ in this regard.

Writing this in Rust would look like this:

pub struct Buffer<T>(Vec<T>);

impl<T> Buffer<T> {
    pub fn new(len: usize) -> Self {
        Buffer(Vec::with_capacity(len))
    }

    /// SAFETY: `by` must be less than or equal to `space_len()` and the bytes at
    /// `space_ptr_mut()` to `space_ptr_mut() + by` must be initialized
    pub unsafe fn increment(&mut self, by: usize) {
        self.0.set_len(self.0.len() + by);
    }

    pub fn space_len(&self) -> usize {
        self.0.capacity() - self.0.len()
    }

    pub fn space_ptr_mut(&mut self) -> *mut T {
        unsafe { self.0.as_mut_ptr().add(self.0.len()) }
    }

    pub fn as_vec(&self) -> &Vec<T> {
        &self.0
    }
}

unsafe fn compress(_input: i32, ptr: *mut u8, len: usize) -> usize {
    // right now just writes 5 bytes if there's space for them

    let written = usize::min(5, len);
    for i in 0..written {
        ptr.add(i).write(0);
    }
    written
}

fn main() {
    let mut buf: Buffer<u8> = Buffer::new(1024);
    let some_input = 5i32;

    unsafe {
        let compressed_len: usize = compress(some_input, buf.space_ptr_mut(), buf.space_len());
        buf.increment(compressed_len);
    }

    let compressed_data = buf.as_vec();
    println!("{:?}", compressed_data);
}

You can see it on the playground. If you run it through Miri, you'll see it picks up no undefined behavior, but if you over-advertise how much you've written (say return written + 10) then it does produce an error that reading uninitialized memory was detected.

One of the reasons there isn't an out-of-the-box type for this is because Vec is that type:

fn main() {
    let mut buf: Vec<u8> = Vec::with_capacity(1024);
    let some_input = 5i32;

    let spare_capacity = buf.spare_capacity_mut();
    unsafe {
        let compressed_len: usize = compress(
            some_input,
            spare_capacity.as_mut_ptr().cast(),
            spare_capacity.len(),
        );

        buf.set_len(buf.len() + compressed_len);
    }

    println!("{:?}", buf);
}

Your Buffer type doesn't really add any convenience or safety and a third-party crate can't do so because it relies on the correctness of compress().

Is such a Buffer really the best way to do this?

Yes, this is pretty much the lowest cost ways to provide a buffer for writing. Looking at the generated release assembly, it is just one call to allocate and that's it. You can get tricky by using a special allocator or simply pre-allocate and reuse allocations if you're doing this many times (but be sure to measure since the built-in allocator will do this anyway, just more generally).

Nit: Might need to be `compressed_len * sizeof(element)`, depending on what `compress` thinks the array type is. — Finomnis, Oct 02 '22 at 15:41
By the way, I know I can do this with Vec. My question is whether this is good, or whether there is a better way. — mcmayer, Oct 02 '22 at 15:45

How to allocate buffer for C library call

1 Answers1