0

I am designing a class in C++ that extracts URLs from an HTML page. I am using Boost's Regex library to do the heavy lifting for me. I started designing a class and realized that I didn't want to tie down how the URLs are stored. One option would be to accept a std::vector<Url> by reference and just call push_back on it. I'd like to avoid forcing consumers of my class to use std::vector. So, I created a member template that took a destination iterator. It looks like this:

template <typename TForwardIterator, typename TOutputIterator>
TOutputIterator UrlExtractor::get_urls(
    TForwardIterator begin,
    TForwardIterator end, 
    TOutputIterator dest);

I feel like I am overcomplicating things. I like to write fairly generic code in C++, and I struggle to lock down my interfaces. But then I get into these predicaments where I am trying to templatize everything. At this point, someone reading the code doesn't realize that TForwardIterator is iterating over a std::string.

In my particular situation, I am wondering if being this generic is a good thing. At what point do you start making code more explicit? Is there a standard approach to getting values out of a function generically?

Travis Parks
  • 8,435
  • 12
  • 52
  • 85
  • 1
    First of all, you should consider parsing the HTML properly using an existing library: http://stackoverflow.com/questions/489522/library-recommendation-c-html-parser – piokuc Sep 25 '12 at 15:25
  • 1
    Regarding not tying the user to a particular container, your design is good because you can use `back_inserter` as the third argument and be container-agnostic. It's how a lot of standard library algorithms work. – Seth Carnegie Sep 25 '12 at 15:26
  • I don't really need to parse HTML. I just need to detect the `href`s in the anchor tags. – Travis Parks Sep 25 '12 at 15:29
  • 1
    HTML is too complex to support parsing by regular expression. Any semi complex page will have script tags (or one of another dozen tags) and once inside there all hell well break loose. You need to use an SAX parser designed to specifically cope with html (especially all the misinformed pages (which are the majority)). A simple technique to get you rolling is described here: http://stackoverflow.com/a/11820174/14065 – Martin York Sep 25 '12 at 15:45

4 Answers4

3

Yes, it's not only fine but a very nice design. Templating that way is how most of the standard library algorithms work, like std::fill or std::copy; they are made to work with iterators so that you can fill a container that already has elements in it, or you can take an empty container and fill it up with data by using std::back_inserter.

This is a very good design IMO, and takes advantage of the power of templates and the iterator concept.

You can use it like this (but you already know this):

std::list<Url> l;
std::vector<Url> v;

x.get_urls(begin(dat1), end(dat1), std::back_inserter(l));
y.get_urls(begin(dat2), end(dat2), std::back_inserter(v));

I get the feeling that you are afraid of using templates, that they are not "normal" C++, or that they should be avoided and are bloated or something. I assure you, they are very normal and a powerful language feature that no other language (that I know of) has, so whenever it is appropriate to use them, USE THEM. And here, it is very appropriate.

Seth Carnegie
  • 73,875
  • 22
  • 181
  • 249
2

Looks to me that you have the wrong interface.

There are already algorithms for copying from iterators in-to containers. Seems to me that your class is providing a stream of urls (without relying modifying its source). So all you really need is a way to expose you internal data via iterators (forward iterators) and thus all you need to provide begin() and end().

UrlExtractor             page(/* Some way of constructing page */);
std::vector<std::string> data;

std::copy(page.begin(), page.end(), std::back_inserter(data));

I would just provide the following interface:

class UrlExtractor
{
    ...... STUFF
    iterator begin(); 
    iterator end();
};
Martin York
  • 257,169
  • 86
  • 333
  • 562
  • Off-topic, but I can't tell if your avatar is an intense left eye or a sad right eye :) – Seth Carnegie Sep 25 '12 at 16:03
  • I have thought about this interface. My concern was explicitly specifying the type of the iterator. For instance, I am not sure if I will handle unicode or not. I suppose I could make the entire class a template on the input char type. – Travis Parks Sep 25 '12 at 19:53
0

Yes, you are being too general. The point of a template is that you can generate multiple copies of the function that behave differently. You probably don't want that because you should pick one way of representing a URL and use that in your entire program.

How about you just do:

typedef std::string url;

That allows you to change the class you use for urls in the future.

Maybe std::vector implements some interface with push_back() in it and your method can take a reference to that interface (back_inserter?).

David Grayson
  • 84,103
  • 24
  • 152
  • 189
  • 1
    I think he is wanting to figure out how to store a _collection_ of urls, not a single one. – Seth Carnegie Sep 25 '12 at 15:29
  • Right. I want to know how to get `URL`s out of a function. I already have a URL class. I want to know if iterator templates are overkill and whether I should just make the collection type explicit, such as `std::vector`. – Travis Parks Sep 25 '12 at 15:34
  • @TravisParks no, you are doing fine as it is :) Take a look at `std::fill`, `std::copy`, and basically everything else that takes iterators; they are meant to work with the magical `back_inserter` _or_ a normal iterator for a container that already has space. – Seth Carnegie Sep 25 '12 at 15:37
  • You don't think a member template is too much? I feel like the generic nature of the STL and Boost are leaking into the rest of my code base. Perhaps this is normal for C++ - I don't know. – Travis Parks Sep 25 '12 at 15:39
  • @TravisParks no, I think it is beautiful, see my answer. – Seth Carnegie Sep 25 '12 at 15:42
0

It's hard to say without knowing the actual use case scenarios, but in general, it's better to avoid templates (or any other unnecessary complexity) unless it actually buys you something. The most obvious signature here would be:

std::vector<Url> UrlExtractor::get_urls( std::string const& source );

Is there really any likely scenario where you'll have to parse anything but an std::string? (There might be if you also supported input iterators. But in practice, if you're parsing, the sources will be either a std::string or an std::istream&. Unless you really want to support the latter, just use std::string.) And of course, client code can do whatever it wants with the returned vector, including appending it to another type of collection.

If the cost of returning a std::vector does become an issue, then you could take an std::vector<Url>& as an argument. I can't see any reasonable scenario where any additional flexibility would buy you much, and a function like get_urls is likely to be fairly complicated, and not the sort of thing you'd want to put in a header.

James Kanze
  • 150,581
  • 18
  • 184
  • 329