2

Question on Java RegEx:

I have a tokenizer where i want to return only tokens that have length above a certain length.

For example: I need to return all tokens that are more than 1 char in this text: "This is a text ."

I need to get 3 tokens: "This", "is", "text" The following tokens are not needed: "a" and ".". Notice that the string can have any character (not only alpha-bet chars)

I tried this code but i am not sure how to complete it:

    String lines[]  = {"This is o n e l e tt e r $ % ! sentence"};


    for(String line : lines)
    {
        String orig = line;

        Pattern Whitespace = Pattern.compile("[\\s\\p{Zs}]+");
        line = Whitespace.matcher(orig).replaceAll(" ").trim();
        System.out.println("Test:\t'" + line + "'");

        Pattern SingleWord = Pattern.compile(".+{1}");  //HOW CAN I DO IT?
        SingleWord.matcher(line).replaceAll(" ").trim();
        System.out.println("Test:\t'" + line + "'");



    }

Thanks

Samer Aamar
  • 1,298
  • 1
  • 15
  • 23

3 Answers3

2

Why you don't use \w{2,} like this :

String line = "This is o n e l e tt e r $ % ! sentence";

Pattern pattern = Pattern.compile("\\w{2,}");
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
    System.out.println(matcher.group());
}

Output

This
is
tt
sentence

Edit

Then you could use this [A-Za-z0-9_@.-]{2,} you can specify your special character that you don't want to avoid, or you can use [^\s]{2,} or \S{2,} a non-whitespace character:

Inputs

This is o email@gmail.com n e l e tt e r $ % ! sentence

Output

This
is
email@gmail.com
tt
sentence
Youcef LAIDANI
  • 55,661
  • 15
  • 90
  • 140
  • because i need charachters that are not alpha-bet For example if there is a string including email then i need to get the email as single token eventually – Samer Aamar May 10 '17 at 16:40
1

If you use Java 8 you can do it this way :

String line = "This is o n e l e tt e r $ % ! sentence";
ArrayList<String> array = new ArrayList<>(Arrays.asList(line.split(" ")));
array.removeIf(u -> u.length() == 1);

array now contains :

This
is
tt
sentence
Omar Aflak
  • 2,918
  • 21
  • 39
0

I would just use something simple like

List<String> words = new LinkedList<String>();
Matcher m = Pattern.compile("\\S{2,}").matcher(line);
while(m.find())
{
    words.add(m.group(0));
}

The \\S (with an uppercase 's') matches all non-space characters.

Disclaimer: I haven't run this, but it should work (maybe with some minimal alterations)

TallChuck
  • 1,725
  • 11
  • 28