1

Consider, for the sake of simplicity, that I want to implement an indexable Vector v with n consecutive elements 0,1,...,n-1, i.e. v[i] = i. This vector is supposed to be filled on demand, that is, if v[i] is used and currently the vector contains n < i+1 elements, the values n+1,n+2,...,i are first pushed onto v, and then the reference to v[i] is returned.

Code below works fine.

struct LazyVector {
    data: Vec<usize>
}

impl LazyVector {
    fn new() -> LazyVector {
        LazyVector{
            data: vec![] 
        }
    }
    fn get(&mut self, i:usize) -> &usize {
        for x in self.data.len()..=i {
            self.data.push(i);
        }
        &self.data[i]
    }
}


pub fn main() {
    let mut v = LazyVector::new();
    println!("v[5]={}",v.get(5)); // prints v[5]=5
}

However, the code above is just a mock-up of the actual structure I'm trying to implement. In addition to that, (1) I'd like to be able to use the index operator and, (2) although the vector may actually be modified when accessing a position, I'd like that to be transparent to the user, that is, I'd like to be able to index any position even if I had an immutable reference to v. Immutable references are preferred to prevent other unwanted modifications.

Requirement (1) could be achieved by implementing the Index trait, like so

impl std::ops::Index<usize> for LazyVector {
    type Output = usize;
    fn index(&self, i: usize) -> &Self::Output {
        self.get(i)
    }
}

However, this does not compile since we need a mutable reference in order to be able to call LazyVector::get. Because of requirement (2) we do not want to make this reference mutable, and even if we did, we couldn't do that since it would violate the interface of the Index trait. I figured that this would make the case for the interior mutability pattern through the RefCell smart pointer (as in Chapter 15 of The Rust Book). So I came up with something like

struct LazyVector {
    data: std::cell::RefCell<Vec<usize>>
}

impl LazyVector {
    fn new() -> LazyVector {
        LazyVector{
            data: std::cell::RefCell::new(vec![]) 
        }
    }

    fn get(&self, i:usize) -> &usize {
        let mut mutref = self.data.borrow_mut();
        for x in mutref.len()..=i {
            mutref.push(x)
        }
        &self.data.borrow()[i] // error: cannot return value referencing a temporary value
    }
}

However this doesn't work because it tries to return a value referencing the Ref struct returned by borrow() that goes out of scope at the end of LazyVector::get. Finally, to circumvent that, I did something like

struct LazyVector {
    data: std::cell::RefCell<Vec<usize>>
}


impl LazyVector {
    fn new() -> LazyVector {
        LazyVector{
            data: std::cell::RefCell::new(vec![]) 
        }
    }

    fn get(&self, i:usize) -> &usize {
        let mut mutref = self.data.borrow_mut();
        for x in mutref.len()..=i {
            mutref.push(x)
        }
        unsafe { // Argh!
            let ptr = self.data.as_ptr();
            &std::ops::Deref::deref(&*ptr)[i]
        }
    }
}


impl std::ops::Index<usize> for LazyVector {
    type Output = usize;
    fn index(&self, i: usize) -> &Self::Output {
        self.get(i)
    }
}

pub fn main() {
    let v = LazyVector::new();    // Unmutable!
    println!("v[5]={}",v.get(5)); // prints v[5]=5
}

Now it works as required but, as a newbie, I am not so sure about the unsafe block! I think I am effectively wrapping it with a safe interface, but I'm not sure. So my question is whether that is OK or if there is a better, totally safe way to achieve that.

Thanks for any help.

Paulo
  • 73
  • 6
  • Since you are returning a reference to `usize`, if your code worked as-is, it would extend the vector and reallocate the memory in the vector while the reference to the `usize` exists, which would lead to an invalid memory access. If you want to do this, you'd need to return a `usize` instead of a reference, which means you can't use the `Index` trait. – loganfsmyth Sep 10 '19 at 03:16
  • The unsafe block is not sound. Adding to a vector could cause it to reallocate, so the reference could end up as a dangling pointer. This is one of the things that Rust protects you from when methods that mutate take `&mut self`. – Peter Hall Sep 10 '19 at 09:26
  • Whatever you do here, it's going to get very complicated. This should be a hint that you are trying to something strange, and you should rethink why you even need this. – Peter Hall Sep 10 '19 at 09:48
  • Oh man! Duh! So obvious now that you point it out. I was so focused on the way this is supposed to be used in the real scenario that I missed this obvious problem. (See comments to next answer) – Paulo Sep 10 '19 at 14:07

1 Answers1

0

EDIT Since you provided more info on your goal (lazy access to chunks of a huge file that lies on disk), I update my answer.

You can use (as you tried) cells. I quote the doc:

Since cell types enable mutation where it would otherwise be disallowed though, there are occasions when interior mutability might be appropriate, or even must be used, e.g. [...] Implementation details of logically-immutable methods. [...]

Here's a piece of code that does the job (note that's very close to what you wrote):

use std::cell::RefCell;
use std::ops::Index;

// This is your file
const DATA: &str = "Rust. A language empowering everyone to build reliable and efficient software.";

#[derive(Debug)]
struct LazyVector<'a, 'b> {
    ref_v: RefCell<&'a mut Vec<&'b str>>
}

impl<'a, 'b> LazyVector<'a, 'b> {
    fn new(v: &'a mut Vec<&'b str>) -> LazyVector<'a, 'b> {
        LazyVector {
            ref_v: RefCell::new(v)
        }
    }

    /// get or load a chunk of two letters
    fn get_or_load(&self, i: usize) -> &'b str {
        let mut v = self.ref_v.borrow_mut();
        for k in v.len()..=i {
            v.push(&DATA[k * 2..k * 2 + 2]);
        }
        v[i]
    }
}

impl<'a, 'b> Index<usize> for LazyVector<'a, 'b> {
    type Output = str;
    fn index(&self, i: usize) -> &Self::Output {
        self.get_or_load(i)
    }
}

pub fn main() {
    let mut v = vec![];
    let lv = LazyVector::new(&mut v);
    println!("v[5]={}", &lv[5]); // v[5]=ng
    println!("{:?}", lv); // LazyVector { ref_v: RefCell { value: ["Ru", "st", ". ", "A ", "la", "ng"] } }
    println!("v[10]={}", &lv[10]); // v[10]=ow
    println!("{:?}", lv); // LazyVector { ref_v: RefCell { value: ["Ru", "st", ". ", "A ", "la", "ng", "ua", "ge", " e", "mp", "ow"] } }
}

The main difference with your try is that the underlying Vec is an external mutable vector, and that LazyVector gets only a (mutable) ref on this vector. A RwLock should be the way to handle concurrent access.

However, I wouldn't recommend that solution:

First, your underlying Vec will rapidly grow and become as huge as the file on disk. Hence, you'll need a map instead of a vector and to keep the number of chunks in that map under a given boundary. If you ask for a chunk that is not in memory, you'll have to choose a chunk to remove. That's simply Paging and the OS is generally better at this game than you (see page replacement algorithm). As I wrote in a comment, memory mapped files (and maybe shared memory in case of "heavy" processes) would be more efficient: the OS handles the lazy loading of the file and the share of the read only data. R. Sedgewick remark in Algorithms in C, first edition, chapter 13, section "An Easier Way", explains why sorting a huge file (bigger than memory) may be easier than one thought:

In a good virtual-memory system, the programmer can address a very large amount of data, leaving to the system the responsibility of making sure that the adressed data is transferred from external to internal storage when needed.

Second, see my previous answer below.

PREVIOUS ANSWER

I coded this kind of vector once... in Java. The use case was to represent a very sparse grid (many of the rows where only a few cells wide, but the grid was supposed to have a width of 1024). To avoid to have to manually add cells when needed, I created a "list" that was doing roughly what you try to achieve (but there was only one default value).

At first, I made my list implement the List interface, but I quickly realized that I had to make a lot of useless (and slow) code not to break the Liskov substitution principle. Worse, the behavior of some methods was misleading regarding to the usual lists (ArrayList, LinkedList, ...).

It seems you are in the same situation: you would like your LazyVector to look like a usual Vec, and that's why you want to implement Index and maybe IndexMut traits. But you are looking for workarounds to achieve this (e.g. unsafe code to match the traits methods signatures).

My advice is: do not try to make LazyVector look like a usual vector, but make it clear that the LazyVector is not a usual vector. This is the Principle of least astonishment. E.g. replace get (expected to only read the data by the user in good faith) by get_or_extend that makes clear that either you get something, either you create it. If you add a get_or_extend_mut function, you have something that is not very attractive but efficient and predictable:

impl LazyVector {
    fn new() -> LazyVector { ... }

    fn get_or_extend(&mut self, i: usize) -> &usize { ... }

    fn get_or_extend_mut(&mut self, i: usize) -> &mut usize { ... }
}
jferard
  • 7,835
  • 2
  • 22
  • 35
  • Thanks all for the comments. Very sensible advice indeed. However, this is actually me trying to explain the problem in simpler terms. In reality this should be a proxy to large text data that sit on disk and don't need to be loaded in memory at once. I only need read-only access to a portion of it at any given time and I wasn't thinking about concurrent access (yet). Contrary to my simplification, in general I want references to slices of the text, not individual letters, and that's why I didn't want to return an owned copy of the elements as (rightfully) suggested in the first comment. – Paulo Sep 10 '19 at 14:12
  • And very importantly, the whole point of my endeavour was to be able to used the Index trait and the syntactic sugar that comes with it. This would simplify things enormously and make the code sooo much nicer to read :( I begin to fear that this won't be possible... Cheers! – Paulo Sep 10 '19 at 14:13
  • @Paulo So you want read access to slices of a big text by index? 1. Have you thought of memory mapped files (see https://en.wikipedia.org/wiki/Memory-mapped_file and https://stackoverflow.com/questions/28516996/how-to-create-and-write-to-memory-mapped-files)? The OS would take care for you of the lazy loading 2. Can you elaborate on "simplify things enormously"? Because it does not seem to me that the difference is so import (`[i]` vs `get_or_load(i)`). – jferard Sep 10 '19 at 14:57
  • @jferad OK, "enormously" might be exaggerated but I mean being able to use the same code on any "indexable" text with subscript syntax would be great. I come from C where this would be like lazy_vec_get_or_load(v, i), so using simply v[i] instead would be neat. I think your suggested get_or_load is basically the get in my first version, which is OK if there is no better way. As for the memory mapped file, I'd seen it, but it seems it doesn't implement the Index trait which, as I said, was very much the point. So I thought I could craft my own thing but apparently I was wrong. Thanks again! – Paulo Sep 10 '19 at 18:17