Algorithm to find same substring from a list of strings

Question

I'm a bit lost here, some help is welcomed. The idea is to find a matching substring from a list of strings. It doesn't has to be perfect. Let's explain this with an example :

"Country_Name" "Country_Id" "Rankcountry" "ThisWillNotMatch"

Would return "country"

It has to be something efficient, as a 'brute force' algo looks a bit scary.

What does "imperfect" match mean(that is, what is the criterion of a match)? — kraskevich, Jan 14 '15 at 15:06
It's a best effort strategy, the idea is proposing 'something' that the end user might change. — ic3, Jan 14 '15 at 15:07
Did you at least start with some code ? You could start from a "brute-force" algorithm that could be efficient enough with not-so-big strings and move to a more efficient one later. — Laurent S., Jan 14 '15 at 15:08
Do I understand correcly? "country" is not given as an input. Any recurring character sequence would need to be detected? What if input has multiple recurring patterns? "CountryHouse" "CountryFoo" "CountryBar" "HouseParty". Would the result be ["country","house" ]? — BitTickler, Jan 14 '15 at 15:09
What's the trade-off between substring length and number of list entries matched? After all, the substring `"o"` matches all list entries. Why does `"country"` rank higher? (And is case to be ignored, as seems to be indicated by the example?) What about the substring `"Country_"`, which matches two entries? Why isn't that better than `"Country"`? — Ted Hopp, Jan 14 '15 at 15:12
This is an interesting problem but far too vague for you to expect a good answer. How many entries are we talking about? What counts as efficient? How would you rank a longer common substring in fewer entries versus a shorter common substring in more entries? In your example it's fairly obvious but if for example your strings are `"Country_Name" "Country_Id" "Rankcountry" "Count Dracula"`, what would the expected result be? — biziclop, Jan 14 '15 at 15:13
Also a question helping to get better answers would be: Which programming language? — BitTickler, Jan 14 '15 at 15:15
@all, yes is vague and there is not going to be allways a 'good' answer. Sorry it's the way it is. The problem is the algo not the program language (so any). — ic3, Jan 14 '15 at 15:17
@user2225104, you're right but Country should have a higher score as it's present in 3 strings. First one is enough. — ic3, Jan 14 '15 at 15:20
@ic3 It is essential to have a clear problem statement to create an algorithm that solves it. I think you should model this problem using more precise terms. It is hard to do it for us because we don't know what you want to achieve exactly. — kraskevich, Jan 14 '15 at 15:22
Levenshtein distance seems to be partially what is asked - string similarity. This instance seems to be: finding the longest common string in objects with high similarity. see: http://en.wikipedia.org/wiki/Longest_common_substring_problem. @ic3 you do see that other strings like 'count' 'try' are also words. This implies that you also need to find the existence of the substrings you encounter in an English dictionary. Messy problem. IMO. — jim mcnamara, Jan 14 '15 at 15:56
Definitely needs a better definition. In particular, what is an optimum match? You'll need to give some weight to both length of sub-string as well as the "rate of inclusion" of the sub-string. From your example, 'country' is long and matches 3 elements, but the letter 'o' matches all 4... Without weights for each of these components to optimality, it is hard to even begin formulating an algorithm. Very interesting problem though! — Rubix Rechvin, Jan 14 '15 at 16:55
With the given input, Levenshtein distance would not yield the desired output, as the input "ThisWillNotMatch" would make it rule out "country" as a solution. — BitTickler, Jan 14 '15 at 18:27
Voted this question up for its "real life significance". Not every practical problem in the real world is accurately stated. — BitTickler, Jan 14 '15 at 18:44

BitTickler · Answer 1 · 2022-04-22T11:50:37.980

Not sure if it is "efficient" or considered brute force... I leave that up to others to judge.

input = list of strings
for each s in input do: computeAllStrides ( string -> string list) (see code below)
create empty, mutable dictionary with key of type string, value of type int
all strides = list of list of strings from step 2 -> update Dictionary with update Dictionary (stride) when stride exists in dictionary -> increase value of respective entry update Dictionary (stride) when stride does not exist in dictionary -> Add (stride,1) to Dictionary
find Dictionary entry which yields the maximum value of stride.Length * frequency
report found maximum value.

In case of case insensitive, perform a toLowercase operation on each input string first.

    open System.Collections.Generic

    let input = ["Country_Name"; "Country_id"; "RankCountry"; "ThisWillNotMatch"; ]

    let rec getAllStrides text =
      let length = String.length text
      match length with
        | 0 -> []
        | 1 -> [text]
        | _ -> [ for i = 1 to length do yield text.Substring(0, i ) ] @ getAllStrides (text.Substring(1))

                                                                                      
    type HashTable = System.Collections.Generic.Dictionary<string,int>

    let update (ht : HashTable) strides =
      List.iter (fun s ->
                 if ht.ContainsKey(s) then ht.[s] <- ht.[s] + 1 else ht.Add( s, 1 )
                 ) strides

    let computeStrideFrequencies input =
      let ht = new HashTable()
      input |> List.iter (fun i -> update ht (getAllStrides i) )
      ht


    let solve input =
      let theBest = input |> computeStrideFrequencies |> Seq.maxBy (fun (KeyValue(k,v)) -> k.Length * v)
      theBest.Key


   solve input;;
   val it : string = "Country"

You could also use a [rolling hash](https://stackoverflow.com/a/52510230/975097) to search for matching substrings here. — Anderson Green, Apr 13 '22 at 16:49

score 1 · Accepted Answer · edited Apr 22 '22 at 11:56

Inspired by Jon Bentley's "Algorithm Alley" column in Dr. Dobb's.

Build an index of every suffix. Sorting the index brings common substrings together. Walk the sorted index comparing adjacent substrings, and you can easily find the longest one (or the most common one).

    #include <algorithm>
    #include <cstddef>
    #include <iostream>
    #include <string>
    #include <vector>

    std::size_t LengthInCommon(const char *left, const char *right) {
      std::size_t length_of_match = 0;
      while (*left == *right && *left != '\0') {
        ++length_of_match;
        ++left;
        ++right;
      }
      return length_of_match;
    }

    std::string FindLongestMatchingSubstring(const std::vector<std::string> &strings) {
      // Build an index with a pointer to each possible suffix in the array.  O(n)
      std::vector<const char *> index;
      for (const auto &s : strings) {
        for (const auto &suffix : s) {
          index.push_back(&suffix);
        }
      }

      // Sort the index using the underlying substrings.  O(n log_2 n)
      std::sort(index.begin(), index.end(), [](const char *left, const char *right) {
        return std::strcmp(left, right) < 0;
      });

      // Common strings will now be adjacent to each other in the index.
      // Walk the index to find the longest matching substring.
      // O(n * m) where m is average matching length of two adjacent strings.
      std::size_t length_of_longest_match = 0;
      std::string match;
      for (std::size_t i = 1; i < index.size(); ++i) {
        const char *left = index[i - 1];
        const char *right = index[i];
        std::size_t length_of_match = LengthInCommon(left, right);
        if (length_of_longest_match < length_of_match) {
          length_of_longest_match = length_of_match;
          match.assign(index[i], index[i] + length_of_longest_match);
        }
      }

      return match;
    }

    int main () {
      std::vector<std::string> strings;
      strings.push_back("Country_Name");
      strings.push_back("Country_id");
      strings.push_back("RankCountry");
      strings.push_back("ThisWillNotMatch");
      std::cout << FindLongestMatchingSubstring(strings) << std::endl;
      return 0;
    }

Prints:

Country_

score 0 · Answer 3 · answered Jan 14 '15 at 16:58

0

I still don't understand why "c" cannot be an answer. I guess you prefer longer strings. Need to get your optimization function straight!

In any case, you can solve this with Tries. Create a Trie for each string. make count 1 for each node. And merge all tries by summing up the counts. This way you get all substrings and their counts. Now, use your optimization function to pick the optimal one.

answered Jan 14 '15 at 16:58

ElKamina

7,747
28
43

In my answer what you called "Tries" I called "Strides". Will I get shot? :) – BitTickler Jan 14 '15 at 17:20

Algorithm to find same substring from a list of strings

3 Answers3