I am writing a function in C++ which should, in theory, take user input and splice this input into segments according to whitespace and return these segments as a vector.
What I am currently doing is by using strtok() on the input string in order to separate the words by whitespace. For each "word", I push it on the buffer vector. After iterating over each word, I return the vector.
So this is the code I have thus far:
#include <iostream>
#include <string>
#include <cstring>
#include <vector>
std::vector<char*> tokenize(std::string input_, char const* delims=" \t\r\n\a")
{
char* input = (char*)input_.c_str();
std::vector<char*> tk_stream;
char* tk = strtok(input, delims);
while(tk != NULL) {
tk_stream.push_back(tk);
tk = strtok(NULL, delims);
}
return tk_stream;
}
int main(int argc, char** argv)
{
while (true) {
std::string input;
std::getline(std::cin, input);
if (input.empty()) {
continue;
}
std::vector<char*> tks = tokenize(input);
for (char* el : tks) {
std::cout << el << std::endl;
}
}
return 0;
}
So what is supposed to happen? well, if I have an input of "1 2 3 4" it should print each of those numbers on separate lines. This actually works with that input. But when the length of the input string is greater, for example, "1 2 3 4 5 6 7 8 9", the output is different:
1 2 3 4 5 6 7 8 9
5
6
7
8
9
It is missing the first 4 numbers! This also happens for any string with a greater length above this and the number of missing numbers is constant. I also noticed this happens with longers sentences. For example "hello everyone this is a test" gives:
hello everyone this is a test
0��
this
is
a
test
I have already done some digging with gdb and found something that is interesting. With the input of "1 2 3 4 5 6 7 8 9", I set up a breakpoint before the 'tk_stream' is returned and checked the value of it:
(gdb) print tk_stream
$1 = std::vector of length 9, capacity 16 = {0x6176c0 "1", 0x6176c2 "2", 0x6176c4 "3", 0x6176c6 "4", 0x6176c8 "5", 0x6176ca "6", 0x6176cc "7", 0x6176ce "8", 0x6176d0 "9"}
This seems correct. But after I step a few lines when this is returned from the function and check the value of 'tks' (the vector which should contain the returned value of the 'tokenize' function); I receive this:
(gdb) print tks
$2 = std::vector of length 9, capacity 16 = {0x6176c0 "", 0x6176c2 "a", 0x6176c4 "", 0x6176c6 "", 0x6176c8 "5", 0x6176ca "6", 0x6176cc "7", 0x6176ce "8", 0x6176d0 "9"}
which is missing the first 4 entries with a garbled 2nd entry.
So something must happen in the returning of the 'tk_stream' vector.
What is the reason for this abnormal behavior? How can I fix this so no elements of the vector are deleted?