-1

Following are the intended output and the original output I got from using this line of code :- ArrayList<String> nodes = new ArrayList<String> (Arrays.asList(str.split("(?i:"+word+")"+"[.,!?:;]?")));

on the input :-

input : "Cow shouts COW! other cows shout COWABUNGA! stupid cow."

The string will be split into an ArrayList at the acceptable "cow" versions.

Original Output(from line above) :
   ArrayList nodes = {, shouts , other , s shout ,ABUNGA! stupid }

vs

Intended Output :
   ArrayList nodes = {, shouts , other cows shout COWABUNGA! stupid }

What I'm trying to achieve :

  1. Case insensitive search. (ACHIEVED)
  2. Takes into account the possibilities of these punctuations ".,:;!?" behind the word that is to be split. hence "[.,!?:;]?" (ACHIEVED)
  3. Only splits if it finds exact word lengths + "[.,!?:;]?". It will not split at "cows" nor "COWABUNGA!" (NOT ACHIEVED, need help)
  4. Find a possible way to add the acceptable splitting-word versions {Cow,COW!,cow.} into another arrayList for future use later in the method. (IN PROGRESS)

As you can see, I have fulfilled 1. and 2. and I am pasting this question first whilst I work on 4.. I know this issue can be solved with more extra lines but I'd like to keep it minimal and efficient.

UPDATE : I found that "{"+input.length+"}" can limit the matches down to letter length but I don't know if it'll work or not.

All help will be appreciated. I apologize if this question is too trivial but alas, I am new. Thanks in advance!

  • What do you mean by 'maintains the original " " characters? As separate tokens? Or attached spaces following or preceding a word? – BillRobertson42 Mar 17 '16 at 21:49
  • As attached spaces following or preceding a word. The examples are in the outputs. I achieved what I wanted (1.). The problem is (3.). Sorry if this is confusing. – Syukri Shukor Mar 17 '16 at 21:53

2 Answers2

1

The following code produces the output you specified given your input. I have broken the regular expression down into named components, so each bit should be self-explanatory.

import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;

public class Moo {

    public static void main(String[] args) {
        String input = "Cow shouts COW! other cows shout COWABUNGA! stupid cow.";
        System.out.println(splitter(input, "cow"));
    }

    public static List<String> splitter(String input, String word) {
        String beginningOfInputOrWordBoundary = "(\\A|\\W)";
        String caseInsensitiveWord = "(?i:"+Pattern.quote(word)+")";
        String optionalPunctuation = "\\p{Punct}?";
        String endOfInputOrWordBoundary = "(\\z|\\W)";
        String regex = 
                beginningOfInputOrWordBoundary +
                caseInsensitiveWord +
                optionalPunctuation +
                endOfInputOrWordBoundary;
        return Arrays.asList(input.split(regex));
    }
}

Resulting output:

[, shouts, other cows shout COWABUNGA! stupid]
BillRobertson42
  • 12,602
  • 4
  • 40
  • 57
0

A word is a sequence of letters. Any character that is not a letter implies the end of a word.

Thus, this should provide the desired result:

(?i:Cow)[^\\p{IsAlphabetic}]
Tobb
  • 11,850
  • 6
  • 52
  • 77