Specify a charset without intepreting ranges

Question

I'm quite puzzled with parsing strings when I have to define in rule the minus and it is just a minus character and not a range of characters between two endpoints.

For example, when you write a rule to percent encode a string of characters you normally would write

*(bk::char_("a-zA-Z0-9-_.~") | '%' << bk::right_align(2, 0)[bk::upper[bk::hex]]);

Which normally means "letters, capital letters, digits, minus sign, underscore, dot and tilde", but the third minus sign would create a range between 9 and underscore or something, so you have to put the minus at the end bk::char_("a-zA-Z0-9_.~-").

It solves current problem but what would one do when the input is dynamic, like user input, and minus sign just means minus character?

How do I prevent from Spirit assign a special meaning to any of possible characters?

EDIT001: I resort to more concrete example from @sehe answer

void spirit_direct(std::vector<std::string>& result, const std::string& input, char const* delimiter)
{
    result.clear();
    using namespace bsq;
    if(!parse(input.begin(), input.end(), raw[*(char_ - char_(delimiter))] % char_(delimiter), result))
        result.push_back(input);
}

in case you want to ensure the minus is treated as minus and not a range one would to alter the code as following (according to @sehe proposal below).

void spirit_direct(std::vector<std::string>& result, const std::string&
    input, char const* delimiter)
{
    result.clear();
    bsq::symbols<char, bsq::unused_type> sym_;
    std::string separators = delimiter;
    for(auto ch : separators)
    {
        sym_.add(std::string(1, ch));
    }
    using namespace bsq;
    if(!parse(input.begin(), input.end(), raw[*(char_ - sym_)] % sym_, result))
        result.push_back(input);
}

Which looks quite elegant. In case of using static constant rule I guess I can escape characters with '\', square brackets were meant as one of those "special" characters which need to be escaped. Why? what is the meaning of []? Is there any additional characters to escape?

score 1 · Accepted Answer · edited May 23 '17 at 12:30

1

Simple.

You devise and specify the supported patterns that the user can supply with their meanings.

Next,

you write the code that transforms it into a character-set (e.g. expand all ranges (if supported in user input) and sort the - to be the first character by definition).
do not use a character set at all.
- why not use char_ [ _pass = my_match_predicate(_1) ]
- why not just make an alternation of literal characters? lit('a') | 'b' | '-' | '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
- why not use qi::symbols<char, char> (or even qi::symbols<char, qi::unused_type> sym_; with raw [ sym_ ] or similar)
  
  Update The qi::symbols<> approach is surprisingly fast: Live On Coliru. I had a recent optimization job where it disappointed: see this answer (under "Spirit (Trie)") – Binary String to Hex c++

In general, I don't know what you're trying to achieve, but Spirit is not well-suited for generating rules on the fly. See some of my existing boost-spirit answers on this site.

edited May 23 '17 at 12:30

Community

1
1

answered May 12 '15 at 11:28

sehe

374,641
47
450
633

ho-ho, you have a lot :) lets looks at your answer http://stackoverflow.com/questions/30046512/splitting-string-using-boost-spirit consider the "delimiter" is ",|-~" then it will capture 4 delimiters (',','|','}','~') when the average Joe meant 4 delimiters too but ',','|','-','~'. I cannot specify the pattern for input its too "far away" and it should be backward compatible (minus just means minus). As for next points - 1) not elegant. 2a) looks cumbersome, but maybe I didnt get it. 2b) much better but again not a top of elegance. 2c) looks like just what the doctor ordered – kreuzerkrieg May 12 '15 at 11:42
1

@kreuzerkrieg "it should be backward compatible" - see, there's your specification! And about the linked answer... [what's wrong with a simple ordering](http://coliru.stacked-crooked.com/a/2b21fd6d120e393c)? – sehe May 12 '15 at 12:30
thats it, it is not MY specification... long story... The ordering stuff - the good one, it is nice and ugly :) IMHO, the symbols<> stuff is more readable and easier to understanding – kreuzerkrieg May 12 '15 at 12:41
odd stuff, when I mention a user with "@" it just disappears from the comment – kreuzerkrieg May 12 '15 at 12:42
1

@kreuzerkrieg OT but that's because you're addressing the author of the post (Q or A) and nobody else was involved in the comment thread. Kinda violates [Principle Of Least Surprise](http://en.wikipedia.org/wiki/Principle_of_least_astonishment) but oh well :) – sehe May 12 '15 at 12:46
@kreuzerkrieg `qi::symbols` may be nice, but the charsets (char_) will compile down to significantly faster parsers, Don't forget your original goal :) Of course, you can do the preprocessing on the `delim` parameter once: **[Live On Coliru](http://coliru.stacked-crooked.com/a/fa4ce5d1972d2d78)** – sehe May 12 '15 at 12:56
1

Actually... strike that. The `qi::symbols<>` approach is surprisingly fast: **[Live On Coliru](http://coliru.stacked-crooked.com/a/81a85f4c3b8fc610)**. I had a recent optimization job where it disappointed: [see this answer (under "Spirit (Trie)"))](http://stackoverflow.com/questions/29210120/binary-string-to-hex-c/29214966#29214966) – sehe May 12 '15 at 12:58

score 0 · Answer 2 · answered May 12 '15 at 11:39

0

Have you tried to use \- bk::char_("a-zA-Z0-9\\-_.~")?

answered May 12 '15 at 11:39

Semen Tykhonenko

312
3
14

nope since I'm looking for something like a directive which disables special character meaning – kreuzerkrieg May 12 '15 at 12:18
1

@kreuzerkrieg it doesn't exist. – sehe May 12 '15 at 12:29

Specify a charset without intepreting ranges

2 Answers2