2

I have a List containing some inextricably words, like

List<String> lookUp = new ArrayList<>();
lookUp.add("New York");
lookUp.add("Big Apple");

For a sentence I want to split it into words, but don't split the inextricably words given in my list. So an example

String sentence = "New York is also called Big Apple";

it should return me

["New York", "is", "also", "called", "Big Apple"]

I started to write an algorithm which first splits the sentence by whitespaces and then I do a loop: For every word I check if this word and it's right neighbour occure in the lookUp-list and, if true, parse these words together.

1) Imagine my lookUp-list also contains inextricably phrases with more than two words, like "George W. Bush" -> my algorithm would only lookup "George W." and "W. Bush" and won't find it in the lookup-list, so it would split it into 3 words.

2) The more important question (for which you can ignore question 1): Is there already a library or even a GATE plugin (so that I don't have to reinvent the wheel)? And does this also exist for german phrases? I couldn't find one =(

Munchkin
  • 4,528
  • 7
  • 45
  • 93
  • It is so trivial issue, so I believe: there is no any special library for that. – Andremoniy Sep 24 '14 at 11:15
  • 1
    What if you get "a b c" and have "a b" and "b c" in your lookup? – aioobe Sep 24 '14 at 11:18
  • Alternate approach: 1) Split by `lookUp` entries, 2) Iterate, for each see whether it's a look-up word, 3) If so, continue, 4) If not, split on whitespace. – jensgram Sep 24 '14 at 11:19
  • Could you elaborate on your step 1? – aioobe Sep 24 '14 at 11:20
  • @ aioobe: good point, I think I would prefer to receive ["a b", "c", "a", "b c"] @jensgram: "1) Split by lookUp entries" you mean: sentence.split(lookUp.get(i)) ?! Or just for(phrase:lookUp){check if sentence contains phrase} – Munchkin Sep 24 '14 at 11:35
  • @ aioobe: also difficult: lookup: ["a b", "a b c", "c d"] and you have a sentence "a b c d" -> example: New York Times Square. (but this is not part of the question anymore :P) – Munchkin Sep 24 '14 at 11:45
  • You mean "multiword expression", right? – Pierre Sep 24 '14 at 19:21
  • btw you should use a trie... it will do the job and it's super easy to implement. – Pierre Sep 25 '14 at 01:53
  • @aioobe I meant exactly what you compiled in your answer :) My own simple hack is [here](http://ideone.com/zBFN0E). – jensgram Sep 25 '14 at 07:34

1 Answers1

0

Another implementation on Java 7 which doesn't use regular expressions:

    List<String> lookUp = new ArrayList<>();
    lookUp.add("New York");
    lookUp.add("New Jersey");
    lookUp.add("Big Apple");
    lookUp.add("George W. Bush");

    String sentence = "New York is also called Big Apple . New Jersey is located near to New York . George W. Bush doesn't live in New Mexico`";

    String currentPhrase = "";
    List<String> parseResult = new ArrayList<>();

    for (String word : sentence.split("\\s+")) {
        currentPhrase += (currentPhrase.isEmpty() ? "" : " ") + word;
        if (lookUp.contains(currentPhrase)) {
            parseResult.add(currentPhrase);
            currentPhrase = "";
            continue;
        }
        boolean phraseFound = false;
        for (String look : lookUp)
            if (look.startsWith(currentPhrase)) {
                phraseFound = true;
                break;
            }

        if (!phraseFound) {
            parseResult.addAll(Arrays.asList(currentPhrase.split("\\s+")));
            currentPhrase = "";
        } 
    }

    System.out.println(parseResult);

Output is:

[New York, is, also, called, Big Apple, ., New Jersey, is, located, near, to, New York, ., George W. Bush, doesn't, live, in, New, Mexico]
Andremoniy
  • 34,031
  • 20
  • 135
  • 241
  • Works! And now imagine, you have "George W." _and_ "George W. Bush" in your lookUp list: It should be parsed to "George W. Bush" (not: "George W", "Bush").... but this goes far beyond my question :D – Munchkin Sep 24 '14 at 11:59
  • I played a little with it. When my lookup-list contains a lot of items, it fails. To be concrete: If you add 100 more phrases to lookup, then it will split the sentence word for word (yes, the sentence contains phrases from lookup!) I'm confused about that =/ – Munchkin Sep 24 '14 at 12:40
  • 1
    Forget about my last comment, it was a mistake of mine :P – Munchkin Sep 24 '14 at 13:12
  • A little workaround to solve my first comment: use `String[] splittedSentence = sentence.split("\\s+");` and do a loop using `i`. After `if (lookUp.contains(currentPhrase))` insert `if(i – Munchkin Sep 24 '14 at 13:58