0

I am trying to parse a simple sentence structure with Boost. This is my first time using Boost, so I could be doing this completely wrong. What I want to do is only accept strings in this format:

  • Must start with a letter (case insensitive)
  • May contain:
    • Alphabetic characters
    • Numeric characters
    • Underscores
    • Hyphens
  • All other characters serve as delimiters

Since I don't know what characters are my delimiters (there could be tons), I have tried to make a regex that is sensitive to that. The only problem is, I am only getting the last letter of each word. This leads me to believe that my regex is correct, but my use of boost is not. Here's my code:

boost::regex regexp("[A-Za-z]([A-Za-z]|[0-9]|_|-)*", boost::regex::normal | boost::regbase::icase);
boost::sregex_token_iterator i(text.begin(), text.end(), regexp, 1);
boost::sregex_token_iterator j;
while(i != j){
    cout << *i++ << std::endl;
}

I modeled this after what I found on the Boost website. I used the last example (at the bottom of the page) as a template to build mf code. In this instance, text is an object of type string.

Is my regex correct? Am I using boost correctly?

beatgammit
  • 19,817
  • 19
  • 86
  • 129

2 Answers2

2

Change your regex to: ([A-Za-z][-A-Za-z0-9_]*)

By putting the parentheses around the whole expression, the entire thing will be captured, not just the last character matched. Putting the - in front causes it to be a matched character and not a range specifier.

Ferruccio
  • 98,941
  • 38
  • 226
  • 299
  • Thanks for the quick response!! I removed the trailing * because I don't want 0-length words. – beatgammit Mar 09 '11 at 13:01
  • 1
    You definitely want the trailing *. Without it, the regex will only match two character words. No need to worry about zero length words. The initial [A-Za-z] must match exactly one character. – Ferruccio Mar 09 '11 at 13:23
1

You're requesting the first submatch for each RE match. That refers to this subexpression: ([A-Za-z]|[0-9]|_|-) and you're getting the last thing that matched (notice that it's qualified by a *) for each match. Hence, the last character. I think you should pass 0 for the submatch number, or just omit that parameter. When I modify your code to do that, it does what I think you're wanting it to do.

Gareth McCaughan
  • 19,888
  • 1
  • 41
  • 62
  • I tried using 0 for the submatch number, but that didn't fix it. While your answer was technically correct, I went with Ferruccio because it was simpler. Thanks though! I wish I could have two correct answers!! – beatgammit Mar 09 '11 at 12:59
  • Well, you could always upvote both of us :-). (Using 0 for the submatch number definitely seems to do the job for me; I wonder why it doesn't for you.) – Gareth McCaughan Mar 09 '11 at 13:04
  • There, I upvoted both of you. I'm not sure what it was, but it works now. I'm sure I had some weird type-o that was fixed by Ferruccio. But when I just used 0, I got even less (only single characters that were not in my regex like whitespace and periods). – beatgammit Mar 09 '11 at 13:16