User RegEx to (un)match all words length above a specific value

Question

Question on Java RegEx:

I have a tokenizer where i want to return only tokens that have length above a certain length.

For example: I need to return all tokens that are more than 1 char in this text: "This is a text ."

I need to get 3 tokens: "This", "is", "text" The following tokens are not needed: "a" and ".". Notice that the string can have any character (not only alpha-bet chars)

I tried this code but i am not sure how to complete it:

    String lines[]  = {"This is o n e l e tt e r $ % ! sentence"};


    for(String line : lines)
    {
        String orig = line;

        Pattern Whitespace = Pattern.compile("[\\s\\p{Zs}]+");
        line = Whitespace.matcher(orig).replaceAll(" ").trim();
        System.out.println("Test:\t'" + line + "'");

        Pattern SingleWord = Pattern.compile(".+{1}");  //HOW CAN I DO IT?
        SingleWord.matcher(line).replaceAll(" ").trim();
        System.out.println("Test:\t'" + line + "'");



    }

Thanks

In your example why the dot is seperated from "text" ? There is no space in between — Omar Aflak, May 10 '17 at 16:28
thanks Wiktor... what does p means? Can you post your answer with some more explanation please? — Samer Aamar, May 10 '17 at 16:38
in fact there are few good answers below , this is one of them thanks — Samer Aamar, May 10 '17 at 20:20

Youcef LAIDANI · Accepted Answer · 2017-05-10T16:49:43.747

2

Why you don't use \w{2,} like this :

String line = "This is o n e l e tt e r $ % ! sentence";

Pattern pattern = Pattern.compile("\\w{2,}");
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
    System.out.println(matcher.group());
}

Output

This
is
tt
sentence

Edit

Then you could use this [A-Za-z0-9_@.-]{2,} you can specify your special character that you don't want to avoid, or you can use [^\s]{2,} or \S{2,} a non-whitespace character:

Inputs

This is o email@gmail.com n e l e tt e r $ % ! sentence

Output

This
is
email@gmail.com
tt
sentence

edited May 10 '17 at 16:49

answered May 10 '17 at 16:32

Youcef LAIDANI

55,661
15
90
140

because i need charachters that are not alpha-bet For example if there is a string including email then i need to get the email as single token eventually – Samer Aamar May 10 '17 at 16:40

score 1 · Answer 2 · answered May 10 '17 at 16:39

If you use Java 8 you can do it this way :

String line = "This is o n e l e tt e r $ % ! sentence";
ArrayList<String> array = new ArrayList<>(Arrays.asList(line.split(" ")));
array.removeIf(u -> u.length() == 1);

array now contains :

This
is
tt
sentence

score 0 · Answer 3 · answered May 10 '17 at 16:32

I would just use something simple like

List<String> words = new LinkedList<String>();
Matcher m = Pattern.compile("\\S{2,}").matcher(line);
while(m.find())
{
    words.add(m.group(0));
}

The \\S (with an uppercase 's') matches all non-space characters.

Disclaimer: I haven't run this, but it should work (maybe with some minimal alterations)

User RegEx to (un)match all words length above a specific value

3 Answers3