How do I find any word that contains all the given characters at least once

Question

I working with this code

                   while((dictionaryWord = br_.readLine()) != null) 
            {
                if(dictionaryWord.matches("^"+word.replace("*" , "." )+"$"))
                {   
                    incrementCounter();
                    System.out.println(dictionaryWord);
                }
            }

Desired Goal: word = dgo

Output: dog, god, dogma megalogdon, etc....

Could there be accented characters, or characters outside the [Basic Multilingual Plane](http://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane)? — Mark Byers, May 12 '12 at 22:04
@MarkByers yeah as long as each character is present at least once — stackoverflow, May 12 '12 at 22:05
I'm sure there is a regex to do this but you can always you `indexOf` or `contains` in a loop over all the `chars` in the desired word. — twain249, May 12 '12 at 22:07
Repost? http://stackoverflow.com/questions/10567365/how-do-i-find-words-that-only-contain-consist-of-a-given-char-sequence — user845279, May 12 '12 at 22:09
@user845279 Thanks for tipping off. And he even got an accepted answer there! — Marko Topolnik, May 12 '12 at 22:26
@user845279 Actually the other question was slightly different. — trutheality, May 12 '12 at 22:29
possible duplicate of [GREP How do I only retrieve words with only the specified letters?](http://stackoverflow.com/questions/10566812/grep-how-do-i-only-retrieve-words-with-only-the-specified-letters) — user unknown, May 12 '12 at 22:46

amit · Accepted Answer · 2012-05-12T22:39:17.777

1

You can build a Set<Character> of all the chars in word, and iterate it. If one character is not in dictionaryWord, then dictionaryWord does not fit. Only if all appear - print dictionaryWord

    String word = "dog";
    String  dictionaryWord;
    BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
    while((dictionaryWord = br.readLine()) != null)  {
        Set<Character> chars = new HashSet<Character>();
        for (char c : word.toCharArray()) {
            chars.add(c);
        }
        boolean match = true;
        for (Character c : chars) {
            String s = "" + c;
            if (!dictionaryWord.contains(s)) {
                match = false;
                break;
            }
        }
        if (match == true) 
            System.out.println(dictionaryWord);
    }

In the above code, the set creation can be moved out of the while loop, of course.

More efficient solution could be to create a Set from dictionaryWord as well, and then check if the intersection of the two sets is identical to the set representing word.
This will be:

    String word = "dog";
    Set<Character> set1 = new HashSet();
    for (char c : word.toCharArray()) {
        set1.add(c);
    }
    String  dictionaryWord;
    BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
    while((dictionaryWord = br.readLine()) != null)  {
        Set<Character> set2 = new HashSet();
        for (char c : dictionaryWord.toCharArray()) {
            set2.add(c);
        }           Set<String> intersection = new HashSet(CollectionUtils.intersection(set1, set2));
        if (set1.equals(intersection)) {
            System.out.println(dictionaryWord);
        } else System.out.println("bad");
    }

using CollectionUtils.intersection() from apache commons

edited May 12 '12 at 22:39

answered May 12 '12 at 22:09

amit

175,853
27
231
333

could you produce this example. Thanks if possible – stackoverflow May 12 '12 at 22:14
1

@stackoverflow: Just provided a code that reads from System.in, and matches for "dog", you can check it out. – amit May 12 '12 at 22:21
@stackoverflow: I also added another - more efficient way to do it. – amit May 12 '12 at 22:34
You don't need `CollectionUtils`: `set2.retainAll(set1); if(set1.equals(set2))...` will accomplish the same. – trutheality May 12 '12 at 23:31
For the record, turning both strings (or char seqs) into sets cannot possibly improve the original performance, which is O(n+m). It may look in your code that there's one loop less, but think about what that call to `intersection` needs to do. The only optimization I see is making a set of `word` chars once and reusing that set since many words are checked against that same set. – Marko Topolnik May 13 '12 at 08:54
@MarkoTopolnik I disagree for two reasons. (1) The original is `O(|set|*|dictionaryWord| + |word|)`, since each `contains()` is itself `O(n)`, and you do it `|set|` times. (2) The size of `set1` and `set2` is bounded to 26, while the size of the string is not. So, with combination of (1) - minimizing the number of times reading `dictionaryWord` is a priority, and the 2nd solution does it only once. – amit May 13 '12 at 10:38
This is what I have in mind: leave `dictionaryWord` as it is -- you only need to iterate over all its chars once. Make the required chars a Set to prevent O(n^2), but make it a singleton set and reuse in each check of `dictionaryWord`. Now you have only O(|dictionaryWord|) (plus hash lookup overhead, whetever that is, but it's not much worse than O(1)). – Marko Topolnik May 13 '12 at 10:49
@MarkoTopolnik: If you want it to be as fast as possible (though unreadable) - you can use a simple `int` as a bit-set (it fits since 32 bits are enough for 26 possible elements), populate it in a single iteration over `dictionaryWord`, and use `operator&` to create intersection of the two sets. Then, just simply check if `set1 & set2 == set1`. I doubt you can get any faster then it.. – amit May 13 '12 at 11:11
No, I said it's O(|dictionaryWord|) with a note that hash lookup itself is slightly worse than O(1), which makes the total complexity slightly worse than O(|dictionaryWord|). I wouldn't limit my solution to lowercase ascii, btw. – Marko Topolnik May 13 '12 at 11:14
1

@MarkoTopolnik: Yea, editted the comment - I misunderstood you on first read. The OP clearly asks for 26 chars, SO is about answering a specific question - and not general cases... so for this *specific* question - the `int` solution is as fast as you can get. – amit May 13 '12 at 11:16
True, OP made that clear in the comment. `BitSet` seems like a nice approach in that case (instead of a raw `int`). – Marko Topolnik May 13 '12 at 11:22

Marko Topolnik · Answer 2 · 2012-05-13T11:36:11.027

1

public static void main(String[] args) {
  final BitSet reqChars = new BitSet(26);
  for (char c : "dog".toCharArray()) reqChars.set(Character.toLowerCase(c) - 'a');
  for (String w : new String[] {"god", "dogma", "megalogdon", "dorm"})
    if (hasAllChars(w, reqChars)) System.out.println(w);
}

public static boolean hasAllChars(String in, BitSet req) {
  req = (BitSet)req.clone();
  for (char c : in.toCharArray()) {
    req.set(Character.toLowerCase(c) - 'a', false);
    if (req.isEmpty()) return true;
  }
  return false;
}

edited May 13 '12 at 11:36

answered May 12 '12 at 22:16

Marko Topolnik

195,646
29
319
436

This is a bad solution because you go to the end of the word after a miss. Suppose you have a 20 character word and the second one misses. That's 5% efficiency. – Rob May 12 '12 at 22:20
@Rob You misunderstand my code. My set contains the required chars. The input word is allowed to contain chars outside the required set of chars, that's not a **miss**. – Marko Topolnik May 12 '12 at 22:23
Yeah the name is misleading. hasAllChars would imply the string would contain all the characters. I did a version of that in my answer. – Rob May 12 '12 at 22:29
@Rob It reads "in *has all chars of* req" and that's what it does. – Marko Topolnik May 12 '12 at 22:30

score 1 · Answer 3 · answered May 12 '12 at 22:23

1

public static boolean containsAllCharacters(String word, Set<String> characters){
    int i = 0;
    int wordLength = word.getLength();
    while (i <= wordLength && characters.contains(word.get(i++)){}
    return i==wordLength;
}

answered May 12 '12 at 22:23

Rob

11,446
7
39
57

I don't think that's correct. What happens with the input "dogma" and character set "ogd" when you reach the 'm'? – veefu May 12 '12 at 22:28
Yeah I saw the solution from @Marko and supplied an implementation of his method. I will modify my answer. – Rob May 12 '12 at 22:31

score 0 · Answer 4 · answered May 12 '12 at 22:44

Actually, the most interesting part of this question is how to avoid looking at every word in the dictionary (though the original code kind of glosses over that). A potentially interesting answer to that would be this:

Make a table of the 26 characters by frequency of occurrence.
Lookup each of the characters, getting the least frequently occurring one.
Then do matches of words that contain that character.

This is assuming, of course, that a single match is cheaper than a regex.

Awesome wikipedia page on the topic here. In this case, the differences might not be huge but in something with e and x for instance, it would be.

How do I find any word that contains all the given characters at least once

4 Answers4