-1

I'm using java, and have a large-ish (~15000) set of keywords (strings), and I have a document (string) that contains these keywords periodically.

I'd like to find the indices of each use of the keywords in the document, with a preference to longer keywords (ones with the most characters). For example, if my keywords were "water", "bottle", "drank", and "water bottle", and my document were "I drank from my water bottle", I'd like a result of:

2 drank

16 water bottle

My initial attempts were to use a trie, and go through the document character-by-character, and whenever a substring matches a keyword, record the initial index. However some of the keywords are prefixes for longer keywords (for example, "water" and "water bottle"), and the code would never find the longer one, as it would record "water"'s index, and then start over.

If it matters, the keywords may contain lower case letters, upper case letters, spaces, hyphens, and apostrophes (and capitalization matters).

So, any help in finding the longest keywords would be much appreciated. Thanks.

user1535846
  • 15
  • 1
  • 5
  • What do you mean, you want to find the largest keyword? – Mordechai Nov 16 '12 at 19:06
  • 1
    Best to post some code and show precisely what it's not doing. – Alan Krueger Nov 16 '12 at 19:20
  • @AlanKrueger My code isn't even close to working, I don't know how much use it'll be. I'm open to using any data structure, I just thought that the structure of a trie would fit best for going word-by-word. – user1535846 Nov 16 '12 at 19:30
  • what about `Pattern` / `Matcher` to search your document for patterns/your keywords? – Tedil Nov 16 '12 at 19:31
  • and if you've got lots of different keywords for a large document, consider using SuffixTrees / SuffixArrays (I'm just pointing you in different directions to investigate) – Tedil Nov 16 '12 at 19:33
  • I edited my first answer with some code that does the right task, take a look! – durron597 Nov 16 '12 at 20:26

2 Answers2

0

If keywords can be built up from smaller keywords, then all you have to do with your code that works is check the longer keywords first. Just a note: I didn't test this at all, I think I put enough work into this problem already! If this helps you don't forget to upvote + accept.

i.e.

import java.util.TreeSet;
import java.util.Comparator;
import java.util.LinkedList;
import java.util.HashMap;
import java.util.Iterator;

public class KeywordSearcher {
    private TreeSet<String> ts;

    public KeywordSearcher() {
    ts = new TreeSet<String>(new Comparator<String>() {
    // Sort all the keywords by length, largest first
        public int compare(String arg0, String arg1) {
            if(arg0.length() > arg1.length()) return -1;
            if(arg0.length() == arg1.length()) return 0;
            return 1;
        }});
    }

    public void addKeyword(String s) {
        ts.add(s);
    }

    private LinkedList<Integer> findKeyword(String document, String s) {
        int start = 0;
        int index;
        LinkedList<Integer> indexes = new LinkedList<Integer>();        

        while(true) {
            index = document.indexOf(s, start);
            if (index == -1) break;
            indexes.add(index);
            start = index + s.length();
        }

        return indexes;
    }

    public HashMap<String, LinkedList<Integer>> findAllKeywords(String document) {
        Iterator<String> is = ts.iterator();
        HashMap<String, LinkedList<Integer>> allIndices = new HashMap<String, LinkedList<Integer>>();

        while(is.hasNext()) {
            String nextKeyword = is.next();
        // See if we found a larger keyword, if we did already, skip this keyword
        boolean foundIt = false;
        for (String key : allIndices.keySet()) {
                if(key.contains(nextKeyword)) {
                    foundIt = true;
                    break;
                }
        }
            if (foundIt) continue;

            // We didn't find the larger keyword, look for the smaller keyword
            LinkedList<Integer> indexes = findKeyword(document, nextKeyword);

            if (indexes.size() > 0) allIndices.put(nextKeyword, indexes);
        }

        return allIndices;
    }
}
durron597
  • 31,968
  • 17
  • 99
  • 158
  • I'm pretty sure this code finds the longest word in the document (which, yes, is just a String), I'm looking to find keywords in the document, which I have in a data structure (the type of structure can change if needed). So, if my keywords are "water", "bottle", "drank", and "water bottle", and my document is "I drank from my water bottle", I'd like to find the indices of "drank" (2) and "water bottle" (16). – user1535846 Nov 16 '12 at 19:38
  • @user1535846: I edited my answer into something that answers your actual question. take a look – durron597 Nov 16 '12 at 20:16
0

If I understand you correctly, you want to skip searching for "water" if you find "water bottle" in the document. That implies some sort of tree structure for your keywords.

My suggestion would be to arrange your keywords on a sorted tree like this:

drank
water bottle
    bottle
    water

In your code, you would search first for the terms that are at the root ("drank" and "water bottle"). If the number of matches for "water bottle" comes up zero, then you would navigate to the next level and search those terms ("bottle" and "water").

Creating the tree would require a bit of work.

But with this tree structure, you can have multiple compound words.

clean water bottle
    clean bottle
        clean
    water bottle
        bottle
        water    
Gilbert Le Blanc
  • 50,182
  • 6
  • 67
  • 111
  • I thought of this but the problem is, what if `clean water bottle` isn't a keyword – durron597 Nov 16 '12 at 20:16
  • I'm not sure I understand your question. I made up "clean water bottle" to show you can have more than one level of keywords in a sorted tree. Your sorted tree would only contain the keywords you want. – Gilbert Le Blanc Nov 16 '12 at 20:30
  • In other words lets say `clean bottle` and `water bottle` and `bottle` are all words. so would `bottle` be a child of both? but then if you find `clean bottle` then don't search for `bottle` under `water bottle`... this can get really complicated. `I am one man` and `am one` and `I am a canary` and `am a` and `am`... – durron597 Nov 16 '12 at 20:36
  • The word "bottle" would be a child of either "clean bottle" or "water bottle". Yes, it can get complicated, which is why I said in my answer that setting up the sorted tree would be a bit of work. – Gilbert Le Blanc Nov 16 '12 at 20:42