2

I'm trying to wrap my head around Rust's generics. I'm writing something to extract HTML from different web sites. What I want is something like this:

trait CanGetTitle {
    fn get_title(&self) -> String;
}

struct Spider<T: CanGetTitle> {
    pub parser: T
}

struct GoogleParser;
impl CanGetTitle for GoogleParser {
    fn get_title(&self) -> String {
        "title from H1".to_string().clone()
    }
}

struct YahooParser;
impl CanGetTitle for YahooParser {
    fn get_title(&self) -> String {
        "title from H2".to_string().clone()
    }
}

enum SiteName {
    Google,
    Yahoo,
}

impl SiteName {
    fn from_url(url: &str) -> SiteName {
        SiteName::Google
    }
}

fn main() {
    let url = "http://www.google.com";
    let site_name = SiteName::from_url(&url);
    let spider: Spider<_> = match site_name {
        Google => Spider { parser: GoogleParser },
        Yahoo => Spider { parser: YahooParser }
    };

    spider.parser.get_title();    // fails
}

I'm getting an error about the match returning Spiders parameterised over two different types. It expects it to return Spider<GoogleParser> because that's the return type of the first arm of the pattern match.

How can I declare that spider should be any Spider<T: CanGetTitle>?

Peter Hall
  • 53,120
  • 14
  • 139
  • 204
jbrown
  • 7,518
  • 16
  • 69
  • 117

2 Answers2

4

How can I declare that spider should be any Spider<T: CanGetTitle>?

Just to add a little to what @Shepmaster already said, spider cannot be any Spider<T>, because it has to be exactly one Spider<T>. Rust implements generics using monomorphization (explained here) which means it compiles a separate version of your polymorphic function for each concrete type that is used. If the compiler cannot deduce a unique T for a particular call site then it's a compile error. In your case, the compiler deduced that the type must be Spider<Google>, but then the next line tries to treat it as Spider<Yahoo>.

Using a trait object lets you defer all of that to runtime. By storing the actual object on the heap and using a Box, the compiler knows how much space needs to be stack allocated (just the size of a Box). But this comes with performance costs: there is extra pointer indirection when the data needs to be accessed and, more significantly, the optimising compiler cannot inline virtual calls.

It is often possible to rejig things so you can work with a monomorphic type anyway. One way to do that in your case is to avoid the temporary assignment to a polymorphic variable, and use the value only at a place where you know its concrete type:

fn do_stuff<T: CanGetTitle>(spider: Spider<T>) {
    println!("{:?}", spider.parser.get_title());
}

fn main() {
    let url = "http://www.google.com";
    let site_name = SiteName::from_url(&url);
    match site_name {
        SiteName::Google => do_stuff(Spider { parser: GoogleParser }),
        SiteName::Yahoo => do_stuff(Spider { parser: YahooParser })
    };
}

Notice that each time do_stuff is called, T resolves to a different type. You only write one implementation of do_stuff, but the compiler monomorphizes it twice - once for each type that you called it with.

If you use a Box then each call to parser.get_title() will have to be looked up in the Box's vtable. But this version will usually be faster by avoiding the need for that lookup, and allowing the compiler the possibility of inlining the body of parser.get_title() in each case.

Peter Hall
  • 53,120
  • 14
  • 139
  • 204
  • Hmm interesting. I think in this case though there'll be a lot of commonality for what I want to do between sites, with the only differences things like exactly which HTML selectors to use to extract the data I need depending on the site, etc. – jbrown Dec 30 '16 at 09:28
  • *at the cost of extra pointer indirection when the data needs to be accessed* => Actually, that's the least cost you pay for it. The greater cost is that baring an optimizer smart enough to devirtualize the call, this inhibits inlining, which is a key enabler for optimizations. So while the cost of an extra pointer dereference/virtual call is very small, the loss of inlining and optimizations can (in tight loops) be very costly indeed. – Matthieu M. Dec 30 '16 at 12:21
  • @MatthieuM. Thanks, made a tweak to make that clear. – Peter Hall Dec 30 '16 at 12:26
3

How can I declare that spider should be any Spider<T: CanGetTitle>?

You cannot. Simply put, the compiler would have no idea how much space to allocate to store spider on the stack.

Instead, you will want to use a trait object: Box<CanGetTitle>:

impl<T: ?Sized> CanGetTitle for Box<T>
where
    T: CanGetTitle,
{
    fn get_title(&self) -> String {
        (**self).get_title()
    }
}

fn main() {
    let innards: Box<CanGetTitle> = match SiteName::Google {
        SiteName::Google => Box::new(GoogleParser),
        SiteName::Yahoo => Box::new(YahooParser),
    };
    let spider = Spider { parser: innards };
}
Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
  • I'm still struggling with this. Will it work with multiple traits though? I'll need things like `ParsePage`, `GetQuery`, etc. and will need something that I can extend to cover all the traits that need implementing. – jbrown Dec 29 '16 at 17:32
  • @jbrown why do you believe it wont work with multiple traits? – Shepmaster Dec 29 '16 at 18:01
  • For some reason I needed to add `?Sized` into `Spider` as well, as in `struct Spider`. This is great to know though, thanks a lot. – jbrown Dec 29 '16 at 19:32
  • @jbrown: The `?Sized` should not be necessary for a concrete `T`. – Matthieu M. Dec 30 '16 at 12:19