4

I am writing a function in C++ which should, in theory, take user input and splice this input into segments according to whitespace and return these segments as a vector.

What I am currently doing is by using strtok() on the input string in order to separate the words by whitespace. For each "word", I push it on the buffer vector. After iterating over each word, I return the vector.

So this is the code I have thus far:

#include <iostream>
#include <string>
#include <cstring>
#include <vector>

std::vector<char*> tokenize(std::string input_, char const* delims=" \t\r\n\a")
{
    char* input = (char*)input_.c_str();
    std::vector<char*> tk_stream;
    char* tk = strtok(input, delims);

    while(tk != NULL) {
        tk_stream.push_back(tk);
        tk = strtok(NULL, delims);
    }

    return tk_stream;
}

int main(int argc, char** argv)
{
    while (true) {
        std::string input;
        std::getline(std::cin, input);

        if (input.empty()) {
            continue;
        }

        std::vector<char*> tks = tokenize(input);
        for (char* el : tks) {
            std::cout << el << std::endl;
        }
    }
    return 0;
}

So what is supposed to happen? well, if I have an input of "1 2 3 4" it should print each of those numbers on separate lines. This actually works with that input. But when the length of the input string is greater, for example, "1 2 3 4 5 6 7 8 9", the output is different:

1 2 3 4 5 6 7 8 9




5 
6 
7 
8 
9

It is missing the first 4 numbers! This also happens for any string with a greater length above this and the number of missing numbers is constant. I also noticed this happens with longers sentences. For example "hello everyone this is a test" gives:

hello everyone this is a test
0��

this
is
a
test

I have already done some digging with gdb and found something that is interesting. With the input of "1 2 3 4 5 6 7 8 9", I set up a breakpoint before the 'tk_stream' is returned and checked the value of it:

(gdb) print tk_stream
$1 = std::vector of length 9, capacity 16 = {0x6176c0 "1", 0x6176c2 "2", 0x6176c4 "3", 0x6176c6 "4", 0x6176c8 "5", 0x6176ca "6", 0x6176cc "7", 0x6176ce "8", 0x6176d0 "9"}

This seems correct. But after I step a few lines when this is returned from the function and check the value of 'tks' (the vector which should contain the returned value of the 'tokenize' function); I receive this:

(gdb) print tks
$2 = std::vector of length 9, capacity 16 = {0x6176c0 "", 0x6176c2 "a", 0x6176c4 "", 0x6176c6 "", 0x6176c8 "5", 0x6176ca "6", 0x6176cc "7", 0x6176ce "8", 0x6176d0 "9"}

which is missing the first 4 entries with a garbled 2nd entry.

So something must happen in the returning of the 'tk_stream' vector.

What is the reason for this abnormal behavior? How can I fix this so no elements of the vector are deleted?

yrmy
  • 43
  • 3

2 Answers2

3

You don't want to be using raw pointers like char*, use std::string instead.

Something like:

std::vector<std::string> tokenize(std::string input_, const std:string delims=" \t\r\n\a")
{
    std::string input = input_;
    std::vector<std::string> tk_stream;
    // ...
Paul Evans
  • 27,315
  • 3
  • 37
  • 54
  • Thank you, but sorry, I forgot to mention in the initial post that I do prefer having the output as char* as opposed to std::string because this code will be used a lot with old C functions. I already fixed the program with UnholySheep's suggestion. – yrmy May 01 '19 at 11:37
  • 1
    That doesn't preclude using a vector of std::string. Just call .c_str() when you want to pass it to C functions. – Andre Kostur May 01 '19 at 14:04
1

You're passing your string by value into your tokenize function. You then call c_str() on that local string object and store pointers into that space into your vector. Your function then returns, and with it the storage in the local string object. Which now means that all of the pointers that you've stored into the vector are now all dangling pointers. Deferencing any of them is Undefined Behaviour.

It "appears to work" for short strings (likely string < 16 characters long) due to something called the Short String Optimization. Many implementations of std::string has a small buffer (a common size is 16 bytes, but it is not defined by Standard) inside the std::string object itself. Once the string gets longer than that, std::string will dynamically allocate a buffer to hold the string. When you use the short string, your dangling pointers are pointing into your stack, and your data has not yet been overwritten there. When you use a long string, your pointers are pointing into some arbitrary place in memory, which may have been overwritten by something else.

Oh, to fix, pass your std::string by reference: const std::string & input_.

Andre Kostur
  • 770
  • 1
  • 6
  • 15