4

This is another question that I can't seem to find an answer to because every example I can find uses vectors and my teacher won't let us use vectors for this class.

I need to read in a plain text version of a book one word at a time using (any number of) blank spaces
' ' and (any number of) non-letter character's as delimiters; so any spaces or punctuation in any amount needs to separate words. Here's how I did it when it was only necessary to use blank spaces as a delimiter:

while(getline(inFile, line)) {
    istringstream iss(line);

    while (iss >> word) {
        table1.addItem(word);
    }
}

EDIT: An example of text read in, and how I need to separate it.

"If they had known;; you wished it, the entertainment.would have"

Here's how the first line would need to be separated:

If

they

had

known

you

wished

it

the

entertainment

would

have

The text will contain at the very least all standard punctuation, but also such things as ellipses ... double dashes -- etc.

As always, thanks in advance.

EDIT:

So using a second stringstream would look something like this?

while(getline(inFile, line)) {
    istringstream iss(line);

    while (iss >> word) {
        istringstream iss2(word);

        while(iss2 >> letter)  {
            if(!isalpha(letter))
                // do something?
        }
        // do something else?
        table1.addItem(word);
    }
}
user3776749
  • 667
  • 1
  • 10
  • 20
  • Use stream to extract one word ignoring whitespace (the default). Then put in a new stringstream and extract 1 char at a time using `std::isalnum` to test if the character should be stored. Or use `remove_if` on a string. – Neil Kirk Nov 23 '14 at 23:13
  • @Neil Kirk Original Post Edited. How would I discard/re-store each character once I've determined if it's a letter or not? – user3776749 Nov 23 '14 at 23:27
  • Don't add it to the output string if it's not alnum. Letter must be char – Neil Kirk Nov 23 '14 at 23:52

2 Answers2

2

I haven't tested this, as I do not have a g++ compiler in front of me now, but it should work (aside from minor C++ syntactic errors)

while (getline(inFile, line))
{
    istringstream iss(line);

    while (iss >> word)
    {
        // check that word has only alpha-numeric characters
        word.erase(std::remove_if(word.begin(), word.end(), 
                                  [](char& c){return !isalnum(c);}),
                   word.end());
        if (word != "")
            table1.addItem(word);
    }
}
vsoftco
  • 55,410
  • 12
  • 139
  • 252
  • This seems to work, though I haven't done a stress test yet. I think this would be a safer bet since it only requires . I do have one question though, could you explain exactly what is happening here: `[](char& c){return !isalnum(c);}` I have a decent idea and I recognize the various parts, but I don't have the context to place exactly what it's doing. – user3776749 Nov 24 '14 at 00:00
  • @user3776749 actually it doesn't really work, as if the string is just something like "test.;works", then the snippet removes the `.;` from it and spits out "testworks" in a single word. The function above is called a lambda function (C++11), and returns true whenever a character is not alphanumeric. I guess the best bet is to write your own tokenizer (or use Boost), although writing your own shouldn't be too much of a pain. For fun I wrote myself a tokenizer, and it's really simple, see: https://github.com/vsoftco/tokenizer/blob/master/src/token.cpp It gives you a general idea. – vsoftco Nov 24 '14 at 00:03
  • @user3776749 So what you should do is to read the `word`, start parsing it and find the first char that is not alphanumeric, add the word, then find the first char that IS alphanumeric, and keep repeating until the end of `word`. – vsoftco Nov 24 '14 at 00:08
  • I found that error in my testing, but in retrospect I don't believe it will be an issue. Since it is a book written in american english two distinct words will always have at least one space between them. This will also correctly process contractions, ie; can't, isn't etc. Thanks for all your help! – user3776749 Nov 24 '14 at 00:09
  • 1
    @user3776749 yeah, if there is a whitespace guaranteed, then all pain is gone :) – vsoftco Nov 24 '14 at 00:09
  • I will note your tokenizer as well though. Just another tool in the toolbox! – user3776749 Nov 24 '14 at 00:11
  • @user3776749 make sure that it's ok, since I think it also removes `'` from `can't` – vsoftco Nov 24 '14 at 00:16
  • It does, but that's fine for my purposes. – user3776749 Nov 24 '14 at 00:19
1

If you are free to use Boost, you can do the following:

$ cat kk.txt
If they had known;; you ... wished it, the entertainment.would have

You can customize the behavior of tokenizer if needed but the default should be sufficient.

#include <iostream>
#include <fstream>
#include <string>

#include <boost/tokenizer.hpp>

int main()
{
  std::ifstream is("./kk.txt");
  std::string line;

  while (std::getline(is, line)) {
    boost::tokenizer<> tokens(line);

    for (const auto& word : tokens)
      std::cout << word << '\n';
  }

  return 0;
}

And finally

$ ./a.out
If
they
had
known
you
wished
it
the
entertainment
would
have
Jiří Pospíšil
  • 14,296
  • 2
  • 41
  • 52
  • This is an interesting solution, and I will save it for future use, but to ensure my teacher doesn't make a fuss I'd like to stick to solutions that only require very basic function libraries. – user3776749 Nov 24 '14 at 00:05