I have a List containing some inextricably words, like
List<String> lookUp = new ArrayList<>();
lookUp.add("New York");
lookUp.add("Big Apple");
For a sentence I want to split it into words, but don't split the inextricably words given in my list. So an example
String sentence = "New York is also called Big Apple";
it should return me
["New York", "is", "also", "called", "Big Apple"]
I started to write an algorithm which first splits the sentence by whitespaces and then I do a loop: For every word I check if this word and it's right neighbour occure in the lookUp-list and, if true, parse these words together.
1) Imagine my lookUp-list also contains inextricably phrases with more than two words, like "George W. Bush" -> my algorithm would only lookup "George W." and "W. Bush" and won't find it in the lookup-list, so it would split it into 3 words.
2) The more important question (for which you can ignore question 1): Is there already a library or even a GATE plugin (so that I don't have to reinvent the wheel)? And does this also exist for german phrases? I couldn't find one =(