7

When the input string is blank, boost::split returns a vector with one empty string in it.

Is it possible to have boost::split return an empty vector instead?

MCVE:

#include <string>
#include <vector>
#include <boost/algorithm/string.hpp>

int main() {
    std::vector<std::string> result;
    boost::split(result, "", boost::is_any_of(","), boost::algorithm::token_compress_on);
    std::cout << result.size();
}

Output:

1

Desired output:

0
sehe
  • 374,641
  • 47
  • 450
  • 633
rustyx
  • 80,671
  • 25
  • 200
  • 267

1 Answers1

4

Compression compresses adjacent delimiters, it does not avoid empty tokens.

If you consider the following, you can see why this works consistently:

Live On Coliru

#include <boost/algorithm/string.hpp>
#include <string>
#include <iostream>
#include <iomanip>
#include <vector>

int main() {
    for (std::string const& test : {
            "", "token", 
            ",", "token,", ",token", 
            ",,", ",token,", ",,token", "token,,"
        })
    {
        std::vector<std::string> result;
        boost::split(result, test, boost::is_any_of(","), boost::algorithm::token_compress_on);
        std::cout << "\n=== TEST: " << std::left << std::setw(8) << test << " === ";
        for (auto& tok : result)
            std::cout << std::quoted(tok, '\'') << " ";
    }
}

Prints

=== TEST:          === '' 
=== TEST: token    === 'token' 
=== TEST: ,        === '' '' 
=== TEST: token,   === 'token' '' 
=== TEST: ,token   === '' 'token' 
=== TEST: ,,       === '' '' 
=== TEST: ,token,  === '' 'token' '' 
=== TEST: ,,token  === '' 'token' 
=== TEST: token,,  === 'token' '' 

So, you might fix it by trimming delimiters from front and end and checking that the remaining input is non-empty:

Live On Coliru

#include <boost/algorithm/string.hpp>
#include <boost/utility/string_view.hpp>
#include <string>
#include <iostream>
#include <iomanip>
#include <vector>

int main() {
    auto const delim = boost::is_any_of(",");

    for (std::string test : {
            "", "token", 
            ",", "token,", ",token", 
            ",,", ",token,", ",,token", "token,,"
        })
    {
        std::cout << "\n=== TEST: " << std::left << std::setw(8) << test << " === ";

        std::vector<std::string> result;

        boost::trim_if(test, delim);
        if (!test.empty())
            boost::split(result, test, delim, boost::algorithm::token_compress_on);

        for (auto& tok : result)
            std::cout << std::quoted(tok, '\'') << " ";
    }
}

Printing:

=== TEST:          === 
=== TEST: token    === 'token' 
=== TEST: ,        === 
=== TEST: token,   === 'token' 
=== TEST: ,token   === 'token' 
=== TEST: ,,       === 
=== TEST: ,token,  === 'token' 
=== TEST: ,,token  === 'token' 
=== TEST: token,,  === 'token' 

BONUS: Boost Spirit

Using Spirit X3, seems to me to be more flexible and potentially more efficient:

Live On Coliru

#include <boost/spirit/home/x3.hpp>
#include <string>
#include <iostream>
#include <iomanip>
#include <vector>

int main() {
    static auto const delim = boost::spirit::x3::char_(",");

    for (std::string test : {
            "", "token", 
            ",", "token,", ",token", 
            ",,", ",token,", ",,token", "token,,"
        })
    {
        std::cout << "\n=== TEST: " << std::left << std::setw(8) << test << " === ";

        std::vector<std::string> result;
        parse(test.begin(), test.end(), -(+~delim) % delim, result);

        for (auto& tok : result)
            std::cout << std::quoted(tok, '\'') << " ";
    }
}
sehe
  • 374,641
  • 47
  • 450
  • 633
  • Excellent example, however I don't understand, why you consider this to work consistently. Even more confusing is fact that boost doc says `This function is equivalent to C strtok` but strtok returns `NULL` for empty string. – vasek Oct 03 '17 at 08:47
  • @vasek They must have meant functionally equivalent (for one thing, it doesn't modify its input :)). Also, it's consistent in that all tokens are always returned. `compress` just means that a new token starts _after_ all adjacent delimiters. I guess they should have named it **`delimiter_compression_on`** instead – sehe Oct 03 '17 at 08:48
  • Added a [Proof-of-Concept](http://coliru.stacked-crooked.com/a/233ef1bfb565f5c1) of the workaround – sehe Oct 03 '17 at 08:54
  • maybe, but still I would expect to return two empty strings for input `,` and empty vector for empty string (but the "workaround" for this is obvious). – vasek Oct 03 '17 at 09:00
  • Added an alternative using Spirit X3 that looks more elegant to me [Live On Coliru](http://coliru.stacked-crooked.com/a/3815775299c88b47) – sehe Oct 03 '17 at 09:01
  • @vasek I cannot think of a single possible consistent explanation for that behaviour you describe (if empty input results in no tokens, then surely one of the sides of `","` ***must*** also get that treatment, so you'd get at most 1 token). Boost Split is consistent (which is probably also clear by looking at the implementation) – sehe Oct 03 '17 at 09:03
  • 1
    Thanks for the Spirit X3 tip, is that the successor or predecessor of Spirit Qi? Gotta marvel at the expressiveness of `-(+~delim) % delim`... What does `~` do, is that documented anywhere? – rustyx Oct 03 '17 at 12:44
  • 1
    It is. Docs: http://www.boost.org/doc/libs/1_65_1/libs/spirit/doc/html/spirit/qi/reference/char/char.html#spirit.qi.reference.char.char.expression_semantics (same for X3, strangely missing that bit of docu http://ciere.com/cppnow15/x3_docs/spirit/quick_reference/char.html). – sehe Oct 03 '17 at 12:47