I want to extract all words from a java String.
word can be written in any european language, and does not contain spaces, only alpha symbols.
it can contain hyphens though.
I want to extract all words from a java String.
word can be written in any european language, and does not contain spaces, only alpha symbols.
it can contain hyphens though.
If you aren't tied to regular expressions, also have a look at BreakIterator, in particular the getWordInstance() method:
Word boundary analysis is used by search and replace functions, as well as within text editing applications that allow the user to select words with a double click. Word selection provides correct interpretation of punctuation marks within and following words. Characters that are not part of a word, such as symbols or punctuation marks, have word-breaks on both sides.
You can use a variation of (?<!\S)\S+(?!\S)
, i.e. any maximal sequence of non-whitespace characters.
\S
to look for something more specific
[A-Za-z-]
, etc)Here's a simple example to illustrate the idea, using [a-z-]
as the alphabet character class:
String text = "--xx128736f-afasdf2137asdf-12387-kjs-23xx--";
Pattern p = Pattern.compile(
"(?<!alpha)alpha+(?!alpha)".replace("alpha", "[a-z-]")
);
Matcher m = p.matcher(text);
while (m.find()) {
System.out.println(m.group());
}
This prints:
--xx
f-afasdf
asdf-
-kjs-
xx--
You may have to use the Unicode character classes etc (stay put, researching on topic right now)
This will match a single word:
`([^\s]+)`