3

As an educational exercise, I'm looking at porting cvs-fast-export to Rust.

Its basic mode of operation is to parse a number of CVS master files into a intermediate form, and then to analyse the intermediate form with the goal of transforming it into a git fast-export stream.

One of the things that is done when parsing is to convert common parts of the intermediate form into a canonical representation. A motivating example is commit authors. A CVS repository may have hundreds of thousands of individual file commits, but probably less than a thousand authors. So an interning table is used when parsing where you input the author as you parse it from the file and it will give you a pointer to a canonical version, creating a new one if it hasn't seen it before. (I've heard this called atomizing or interning too). This pointer then gets stored on the intermediate object.

My first attempt to do something similar in Rust attempted to use a HashSet as the interning table. Note this uses CVS version numbers rather than authors, this is just a sequence of digits such as 1.2.3.4, represented as a Vec.

use std::collections::HashSet;
use std::hash::Hash;

#[derive(PartialEq, Eq, Debug, Hash, Clone)]
struct CvsNumber(Vec<u16>);

fn intern<T:Eq + Hash + Clone>(set: &mut HashSet<T>, item: T) -> &T {
    let dupe = item.clone();
    if !set.contains(&item) {
        set.insert(item);
    }
    set.get(&dupe).unwrap()
}

fn main() {
    let mut set: HashSet<CvsNumber> = HashSet::new();
    let c1 = CvsNumber(vec![1, 2]);
    let c2 = intern(&mut set, c1);
    let c3 = CvsNumber(vec![1, 2]);
    let c4 = intern(&mut set, c3);
}

This fails with error[E0499]: cannot borrow 'set' as mutable more than once at a time. This is fair enough, HashSet doesn't guarantee references to its keys will be valid if you add more items after you have obtained a reference. The C version is careful to guarantee this. To get this guarantee, I think the HashSet should be over Box<T>. However I can't explain the lifetimes for this to the borrow checker.

The ownership model I am going for here is that the interning table owns the canonical versions of the data, and hands out references. The references should be valid as long the interning table exists. We should be able to add new things to the interning table without invalidating the old references. I think the root of my problem is that I'm confused how to write the interface for this contract in a way consistent with the Rust ownership model.

Solutions I see with my limited Rust knowledge are:

  1. Do two passes, build a HashSet on the first pass, then freeze it and use references on the second pass. This means additional temporary storage (sometimes substantial).
  2. Unsafe

Does anyone have a better idea?

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
Laurence
  • 10,896
  • 1
  • 25
  • 34

2 Answers2

5

I somewhat disagree with @Shepmaster on the use of unsafe here.

While right now it does not cause issue, should someone decide in the future to change the use of HashSet to include some pruning (for example, to only ever keep a hundred authors in there), then unsafe will bite you sternly.

In the absence of a strong performance reason, I would simply use a Rc<XXX>. You can alias it easily enough: type InternedXXX = Rc<XXX>;.

use std::collections::HashSet;
use std::hash::Hash;
use std::rc::Rc;

#[derive(PartialEq, Eq, Debug, Hash, Clone)]
struct CvsNumber(Rc<Vec<u16>>);

fn intern<T:Eq + Hash + Clone>(set: &mut HashSet<T>, item: T) -> T {
    if !set.contains(&item) {
        let dupe = item.clone();
        set.insert(dupe);
        item
    } else {
        set.get(&item).unwrap().clone()
    }
}

fn main() {
    let mut set: HashSet<CvsNumber> = HashSet::new();
    let c1 = CvsNumber(Rc::new(vec![1, 2]));
    let c2 = intern(&mut set, c1);
    let c3 = CvsNumber(Rc::new(vec![1, 2]));
    let c4 = intern(&mut set, c3);
}
Matthieu M.
  • 287,565
  • 48
  • 449
  • 722
  • Could you clarify on *who* might change `HashSet`? I don't suppose you mean that the implementer of `HashSet` would change (there's not much we can do about that, anyway). If you mean the *use* of `HashSet` inside the interner, then yes. That's why I always advocate the use of comments next to each `unsafe` to document it. I thought about using `Rc`, but didn't like the interaction with the forced allocation. You can't intern `&[u16]`, for example. – Shepmaster Oct 22 '16 at 14:23
  • @Shepmaster: Note that I say *change the use of `HashSet`*, by which I mean that either the OP or some other future contributor might break the invariant that data is never removed from the set. As for commenting the use of the `unsafe`, this is a commendable idea, however one might not think of looking into the `intern` method when adding another method... and one might misunderstand or misinterpret comments, too. In *this specific* example, it is simple enough certainly, but I would prefer avoiding cultivating a culture of "just use `unsafe` here". – Matthieu M. Oct 22 '16 at 14:34
  • To clarify my previous comment for future readers, I dislike this solution in *this* case because it requires *extraneous allocations*. Specifically, every string that wants to be interned must first be allocated, then a hashmap checked, then either stored or deallocated. The total number of allocations over the lifetime of the program is the *same*, but duplicates are deallocated much earlier, so the maximum memory usage at any given time is lower. There is also some amount of runtime overhead to increment / decrement the refcount. If those limitations are OK, this is a perfect solution! – Shepmaster Oct 22 '16 at 18:52
  • I narrowly missed writing this myself. I missed the point of `Rc` is that you can return `T` instead of `&T`. I've marked this answer as correct, even though @Shepmaster has more strictly answered my question as you are right that this should be my first solution, and I should only go for the unsafe version after profiling. It is nice to know both options exist. – Laurence Oct 22 '16 at 18:53
4

Your analysis is correct. The ultimate issue is that when modifying the HashSet, the compiler cannot guarantee that the mutations will not affect the existing allocations. Indeed, in general they might affect them, unless you add another layer of indirection, as you have identified.

This is a prime example of a place that unsafe is useful. You, the programmer, can assert that the code will only ever be used in a particular way, and that particular way will allow the variable to be stable through any mutations. You can use the type system and module visibility to help enforce these conditions.

Note that String already introduces a heap allocation. So long as you don't change the String once it's allocated, you don't need an extra Box.

Something like this seems like an OK start:

use std::{cell::RefCell, collections::HashSet, mem};

struct EasyInterner(RefCell<HashSet<String>>);

impl EasyInterner {
    fn new() -> Self {
        EasyInterner(RefCell::new(HashSet::new()))
    }

    fn intern<'a>(&'a self, s: &str) -> &'a str {
        let mut set = self.0.borrow_mut();

        if !set.contains(s) {
            set.insert(s.into());
        }

        let interned = set.get(s).expect("Impossible missing string");

        // TODO: Document the pre- and post-conditions that the code must
        // uphold to make this unsafe code valid instead of copying this
        // from Stack Overflow without reading it
        unsafe { mem::transmute(interned.as_str()) }
    }
}

fn main() {
    let i = EasyInterner::new();

    let a = i.intern("hello");
    let b = i.intern("world");
    let c = i.intern("hello");

    // Still strings
    assert_eq!(a, "hello");
    assert_eq!(a, c);
    assert_eq!(b, "world");

    // But with the same address
    assert_eq!(a.as_ptr(), c.as_ptr());
    assert!(a.as_ptr() != b.as_ptr());

    // This shouldn't compile; a cannot outlive the interner
    // let x = {
    //     let i = EasyInterner::new();
    //     let a = i.intern("hello");
    //     a
    // };

    let the_pointer;
    let i = {
        let i = EasyInterner::new();
        {
            // Introduce a scope to contstrain the borrow of `i` for `s`
            let s = i.intern("inner");
            the_pointer = s.as_ptr();
        }
        i // moving i to a new location
          // All outstanding borrows are invalidated
    };

    // but the data is still allocated
    let s = i.intern("inner");
    assert_eq!(the_pointer, s.as_ptr());
}

However, it may be much more expedient to use a crate like:

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
  • I'm torn as to which of yours and Matthieu's answers to tick. I think you have more strictly answered the question I asked, but Matthieu has given a more idiomatic rust solution, albeit with with additional overheads and safety. Both answers make valuable contributions, I wish I could tick both. – Laurence Oct 22 '16 at 18:46
  • @Laurence no worries! You are supposed to accept whichever answer best helps *you*. Upvotes are for questions and answers that were useful to *someone*. Over time, people may upvote a non-accepted answer more or less than the accepted one. – Shepmaster Oct 22 '16 at 18:55
  • 1
    @Laurence Although I **would** say that string_cache is probably the better answer than either version of rolling your own. – Shepmaster Oct 22 '16 at 18:56