Regular expression to retrieve words from files

Question

I have a set of files in particular diretory.

After retrieving the contents from all the files(text files) in the directory, I have a List of Strings.

Each string element represents the retrieved content from each file. So the first String element in the list represents the content from first file.

Now I want to split the string to get words.(Later the words store into an array of strings) 1) words can be seperated by single space/multiple space. 2) Sentences are end by a '.', so a new word can be started after '.' 3) A new word can start after '\n'

So can anyone suggest a regular expression which can fit into split() method?

This is probably very similar question: http://stackoverflow.com/questions/2159026/regex-how-to-get-words-from-a-string-c — wlk, Apr 13 '12 at 10:58

score 4 · Answer 1 · answered Apr 13 '12 at 10:58

4

Perhaps the StringTokenizer class is a better fit for your need. The constructor takes the string to tokenize and a list of delimiters (in your case: space, ., and line break).

answered Apr 13 '12 at 10:58

Mathias Schwarz

7,099
23
28

According to the `StringTokenizer` javadocs: `StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.` – John B Apr 13 '12 at 11:17

score 1 · Answer 2 · answered Apr 13 '12 at 11:19

1

String[] result = myString.split("[\\.\\s]");

answered Apr 13 '12 at 11:19

John B

32,493
6
77
98

score 0 · Answer 3 · answered Apr 13 '12 at 10:58

0

You probably don't need regexp for this, just remove every nonletter charcters from file, and use Tokenizer to read each word.

answered Apr 13 '12 at 10:58

wlk

5,695
6
54
72

`-` is a non-letter character. Doesn't seem like it should be removed. Also, if you remove all non-letter characters you end up with one single really long word. – John B Apr 13 '12 at 11:13
I retrieved the file contents as list of strings in which each element represent individual file contents. Now if I use delimiter in split() method, what should I replace in the place of delimiters? – Rahul Raj Apr 13 '12 at 11:15
@John B , Wojtek was probably telling to track the words by the detection of non-letter characters.. – Rahul Raj Apr 13 '12 at 11:17
Obviously you will have not to remove whitespaces, but this was my general idea how to resolve this issue. – wlk Apr 13 '12 at 12:30

score -1 · Answer 4 · answered Apr 13 '12 at 11:03

I would suggest using tokens for this ... simply go through each character and decide what to do based on what the character is. Here's the pseudo-code

string word = "";

while ( EOF ){

    char = getNextChar()

    if ( char not space or full-stop ){
        append the char to the word
    }
    else {
        if ( the word is empty ){ continue /* ignore multi space */ }
        else {
            add the word to an array of words
            reset the word to ""
        }
    }
}

This way, you have a complete control of the way you process the data - you don't have to worry about crazy scenarios with to include in the regex rule. Most of all, this is the most efficient way (def better than regex) and you do only a single pass through the data.

There are lot of tools already written for doing this. I would not encourage reinventing the wheel. — John B, Apr 13 '12 at 11:12

Regular expression to retrieve words from files

4 Answers4