Find space separated names using Apache OpenNLP

Question

I am using NER of Apache Open NLP. I have successfully trained my custom data. And while using the name finder, I am splitting the given string based on white space and passing the string array as given below.

NameFinderME nameFinder = new NameFinderME(model);   
String []sentence = input.split(" "); //eg:- input = Give me list of test case in project X
Span nameSpans[] = nameFinder.find(sentence);

Here, when I use split, test and case are given as separate values and is never detected by the namefinder. How would I possibly overcome the above issue. Is there a way by which I can pass the complete string (without splitting it into array) such that, test case will be considered as a whole by itself ?

Iakovos · Answer 1 · 2017-02-02T10:49:37.463

You can do it using regular expressions. Try replacing the second line with this:

String []sentence = input.split("\\s(?<!(\\stest\\s(?=case\\s)))");

Maybe there is a better way to write the expression, but this works for me and the output is:

Give
me
list
of
test case
in
project
X

EDIT: If you are interested in the details check here where I split: https://regex101.com/r/6HLBnL/1

EDIT 2: If you have many words that don't get separated, I wrote a method that generates the regex for you. This is how the regex in this case should look like (if you don't want to separate 'test case' and 'in project'):

\s(?<!(\stest\s(?=case\s))|(\sin\s(?=project\s)))

Following is a simple program to demonstrate it. In this example you just put the words that don't need separation in the array unseparated.

class NoSeparation {

private static String[][] unseparated = {{"test", "case"}, {"in", "project"}};

private static String getRegex() {
    String regex = "\\s(?<!";

    for (int i = 0; i < unseparated.length; i++)
        regex += "(\\s" + separated[i][0] + "\\s(?=" + separated[i][1] + "\\s))|";

    // Remove the last |
    regex = regex.substring(0, regex.length() - 1);

    return (regex + ")");
}

public static void main(String[] args) {
    String input = "Give me list of test case in project X";
    String []sentence = input.split(getRegex());

    for (String i: sentence)
        System.out.println(i);
}
}

EDIT 3: Following is a very dirty way to handle strings with more than 2 words. It works, but I am pretty sure that you can do it in a more efficient way. It will work fine in short inputs, but in longer it will probably be slow.

You have to put the words that should not be splitted in a 2d array, as in unseparated. You should also choose a separator if you don't want to use %% for some reason (e.g. if there is a chance your input contains it).

class NoSeparation {

private static final String SEPARATOR = "%%";
private static String[][] unseparated = {{"of", "test", "case"}, {"in", "project"}};

private static String[] splitString(String in) {
    String[] splitted;

    for (int i = 0; i < unseparated.length; i++) {
        String toReplace = "";
        String replaceWith = "";
        for (int j = 0; j < unseparated[i].length; j++) {
            toReplace += unseparated[i][j] + ((j < unseparated[i].length - 1)? " " : "");
            replaceWith += unseparated[i][j] + ((j < unseparated[i].length - 1)? SEPARATOR : "");
        }

        in = in.replaceAll(toReplace, replaceWith);
    }

    splitted = in.split(" ");

    for (int i = 0; i < splitted.length; i++)
        splitted[i] = splitted[i].replaceAll(SEPARATOR, " ");

    return splitted;
}

public static void main(String[] args) {
    String input = "Give me list of test case in project X";
    // Uncomment this if there is a chance to have multiple spaces/tabs
    // input = input.replaceAll("[\\s\\t]+", " ");

    for (String str: splitString(input))
        System.out.println(str);
}
}

OK, what if I have a lot of space separated words (ranging from 15-20). How would I use `split()` function in that case? And will it be efficient to follow this approach in that case? — Hari Ram, Jan 31 '17 at 03:58
@HariRam please check my 2nd edit. I added some code that does that. — Iakovos, Jan 31 '17 at 11:06
Dude! But the thing is, I may also have 3 or 4 spaces (defect detected in cycle) between the words. How should the regex look like in case of 3 or 4 spaces between them? I won't mind writing a function that would generate the regex. I just nee the format of regex string in the above mentioned case. — Hari Ram, Feb 01 '17 at 07:39
This case is a bit more tricky. The easiest way would be to replace the multiple spaces with single space and run the generated regex. Is this acceptable in your scenario? In this occasion, before running `split()` you should do `input = input.replaceAll("[\\s\\t]+", " ");` — Iakovos, Feb 01 '17 at 15:19
No no.. Actually what I mean by multiple space is that, "test case id" has two spaces. Whereas "test case" has one. I won't have consecutive white spaces (like "test case") in my case. — Hari Ram, Feb 02 '17 at 01:17
Oh ok I got it now. Please check the 3rd edit. Probably it is not the most efficient code, but it does the job. — Iakovos, Feb 02 '17 at 10:51

Find space separated names using Apache OpenNLP

1 Answers1