18

In my C++ code, I want to read from a text file (*.txt) and tokenize every entry. More specifically, I want to be able to read individual words from a file, such as "format", "stack", "Jason", "europe", etc.

I chose to use fstream to perform this task, and I do not know how to set it's delimiter to the ones I want to use (space, \n, as well as hyphens and even apostrophes as in "Mcdonal's"). I figured space and \n are the default delimiters, but hyphens are not, but I want to treat them as delimiters so that when parsing the file, I will get words in "blah blah xxx animal--cat" as simply "blah", "blah", "xxx", "animal", "cat".

That is, I want to be able to get two strings from "stack-overflow", "you're", etc, and still be able to maintain \n and space as delimiters at the same time.

FrozenLand
  • 259
  • 1
  • 4
  • 14

2 Answers2

25

An istream treats "white space" as delimiters. It uses a locale to tell it what characters are white space. A locale, in turn, includes a ctype facet that classifies character types. Such a facet could look something like this:

#include <locale>
#include <iostream>
#include <algorithm>
#include <iterator>
#include <vector>
#include <sstream>

class my_ctype : public
std::ctype<char>
{
    mask my_table[table_size];
public:
    my_ctype(size_t refs = 0)  
        : std::ctype<char>(&my_table[0], false, refs)
    {
        std::copy_n(classic_table(), table_size, my_table);
        my_table['-'] = (mask)space;
        my_table['\''] = (mask)space;
    }
};

And a little test program to show it works:

int main() {
    std::istringstream input("This is some input from McDonald's and Burger-King.");
    std::locale x(std::locale::classic(), new my_ctype);
    input.imbue(x);

    std::copy(std::istream_iterator<std::string>(input),
        std::istream_iterator<std::string>(),
        std::ostream_iterator<std::string>(std::cout, "\n"));

    return 0;
}

Result:

This
is
some
input
from
McDonald
s
and
Burger
King.

istream_iterator<string> uses >> to read the individual strings from the stream, so if you use them directly, you should get the same results. The parts you need to include are creating the locale and using imbue to make the stream use that locale.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • So are you using visual studio? I put the code in visual studio (properly) and it doesnt compile... – FrozenLand Apr 29 '12 at 22:16
  • @user1348863: Yes, I tested it with Visual Studio 10. – Jerry Coffin Apr 29 '12 at 22:17
  • 1
    Excellent! N.B: [**`std::copy_n()`**](http://en.cppreference.com/w/cpp/algorithm/copy_n) is a C++11ism. Older compilers will need `std::copy(classic_table(), classic_table() + table_size, my_table);` (or similar). – johnsyweb Apr 29 '12 at 22:17
  • 1
    Yes, if you're using an older compiler you can use `std::copy(classic_table(), classic_table()+table_size, my_table);` instead of `std::copy_n`. – Jerry Coffin Apr 29 '12 at 22:19
  • 1>------ Build started: Project: Lab5, Configuration: Debug Win32 ------ 1>Build started 4/29/2012 6:24:02 PM. 1>InitializeBuildStatus: 1> Touching "Debug\Lab5.unsuccessfulbuild". 1>ClCompile: 1> test2.cpp 1>e:\visual studio\lab5\lab5\test2.cpp(22): error C2628: 'my_ctype' followed by 'int' is illegal (did you forget a ';'?) 1>e:\visual studio\lab5\lab5\test2.cpp(22): error C3874: return type of 'main' should be 'int' instead of 'my_ctype' 1> 1>Build FAILED. 1> 1>Time Elapsed 00:00:00.75 ========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ========== – FrozenLand Apr 29 '12 at 22:24
  • @user1348863: It *sounds* like you missed copying/pasting the semicolon at the end of the definition of the `my_ctype` class (and the error message is telling you that fairly directly). – Jerry Coffin Apr 29 '12 at 22:26
  • 1
    This compiles and runs without any modification using GCC (gcc-4.5.1). [See it here](http://ideone.com/6a0CT)! – johnsyweb Apr 29 '12 at 22:33
  • yes it now compiles and runs. I am trying to incorporate this into my own codes. So how do I do something like : string word_holder; ifstream input('book.txt'); input>>word_holder; input>>word_holder ...... and so on, whenever i want the next token. – FrozenLand Apr 29 '12 at 22:42
  • 1
    @user1348863: You have to use `imbue` to tell that stream to use the locale. After that, `your_stream >> your_string` should read a token, treating either `,` or `'` as a delimiter. – Jerry Coffin Apr 29 '12 at 22:50
  • Is it possible to specify delimiters to istream in such a way that they are *not* thrown away? I.e., if one wants to read "{{a,bb},{ccc,dddd}}" as "{ { a , bb } , { ccc , ddd } }"? Standard istream doesn't stop reaching a whitespace, but the above throws away all the delimiters, resulting in "a bb ccc dddd". – James Apr 20 '15 at 19:48
  • @James: sorry, but no. For something like that, you'll probably have to do (at least some of) the work yourself, such as with Boost Spirit. – Jerry Coffin Apr 20 '15 at 19:55
2

You can use

istream::getline(char* buffer, steamsize maxchars, char delim)

although this only supports a single delimiter. To further split the lines on your different delimiters, you could use

char* strtok(char* inString, const char* delims)  

which takes multiple delimeters. When you use strtok you only need to pass it the address of your buffer the first time - after that just pass in a null and it will give you the next token from the last one it gave you, returning a null pointer when there are no more.

EDIT: A specific implementation would be something like

char buffer[120]; //this size is dependent on what you expect the file to contain
while (!myIstream.eofbit) //I may have forgotten the exact syntax of the end bit
{
    myIstream.getline(buffer, 120); //using default delimiter of \n
    char* tokBuffer;
    tokBuffer = strtok(buffer, "'- ");
    while (tokBuffer != null) {
        cout << "token is: " << tokBuffer << "\n";
        tokBuffer = strtok(null, "'- "); //I don't need to pass in the buffer again because it remembers the first time I called it
    }
}
QuantumRipple
  • 1,161
  • 13
  • 20
  • So could you be more specific? Let's say I want to read stack-overflow as two separate words stack and overflow, how do I do this? (I still need to use space and \n as delimiters at the same time.) Also, like, Let's into let and s. thank you! – FrozenLand Apr 29 '12 at 22:01
  • Edited version should tokenize on \n, ', -, and space. – QuantumRipple Apr 29 '12 at 22:17
  • Looks good, but what if my file is *.txt of 1MB? what do I put in place of 120? – FrozenLand Apr 29 '12 at 22:27
  • Do you have any limit on the line length? (or you might want to have the getLine tokenize on space) – QuantumRipple Apr 29 '12 at 22:59