9

For a simple problem of array length 5 to start with ( in practice the array length might be 20.. )

I have got a predefined set of patterns, like AAAAB, AAABA, BAABC, BCAAA, .... Each pattern is of the same length of the input array. I would need a function that takes any integer array as input, and returns all the patterns it matches. (an array may match a few patterns) as fast as possible.

"A" means that in the pattern all numbers at the positions of A are equal. E.g. AAAAA simply means all numbers are equal, {1, 1, 1, 1, 1} matches AAAAA.

"B" means the number at the positions B are not equal to the number at the position of A. (i.e. a wildcard for a number which is not A)Numbers represented by B don't have to be equal. E.g. ABBAA means the 1st, 4th, 5th numbers are equal to, say x, and 2nd, 3rd are not equal to x. {2, 3, 4, 2, 2} matches ABBAA.

"C" means this position can be any number (i.e. a wildcard for a number). {1, 2, 3, 5, 1} matches ACBBA, {1, 1, 3, 5, 1} also matches ACBBA

I am looking for an efficient ( in terms of comparisons number) algorithm. It doesn't have to be optimal, but shouldn't be too bad from optimal. I feel it is sort-of like the decision tree...

A very straightforward but inefficient way is like the following:

  • Try to match each pattern against the input. say AABCA against {a, b, c, d, e}. It checks if (a=b=e && a!=c).

  • If the number of patterns is n, the length of the pattern/array is m, then the complexity is about O(n*m)

Update:

Please feel free to suggest better wordings for the question, as I don't know how to make the question simple to understand without confusions.

An ideal algorithm would need some kind of preparation, like to transform the set of patterns into a decision tree. So that the complexities after preprocessing can be achieved to something like O(log n * log m) for some special pattern sets.(just a guess)

Some figures that maybe helpful: the predefined pattern sets is roughly of the size of 30. The number of input arrays to match with is about 10 millions.

Say, if AAAAA and AAAAC are both in the pre defined pattern set. Then if AAAAA matches, AAAAC matches as well. I am looking for an algorithm which could recognize that.

Update 2

@Gareth Rees 's answer gives a O(n) solution, but under assumption that there are not many "C"s. (otherwise the storage is huge and many unnecessary comparisons)

I would also welcome any ideas on how to deal with situations where there are many "C"s, say, for input array of length 20, there are at least 10 "C"s for each predefined patterns.

colinfang
  • 20,909
  • 19
  • 90
  • 173

2 Answers2

6

Here's an idea that trades O(2n) preparation and storage for O(n)-ish runtime. If your arrays are no longer than your machine's word size (you imply that 20 would be a typical size), or if there are not too many occurrences of C in the patterns, this idea might work for you. (If neither of these conditions is satisfied, avoid!)

  1. (Preparatory step, done once.) Create a dictionary d mapping numbers to sets of patterns. For each pattern p, and each subset S of the occurrences of C in that pattern, let n be the number that has a set bit corresponding to each A in the pattern, and for each occurrence of C in S. Add p to the set of patterns d[n].

  2. (Remaining steps are done each time a new array needs to be matched against the patterns.) Create a dictionary e mapping numbers to numbers.

  3. Let j run over the indexes of the array, and for each j:

    1. Let i be the j-th integer in the array.

    2. If i is not in the dictionary e, set e[i] = 0.

    3. Set e[i] = e[i] + 2ℓ − j − 1 where ℓ is the length of the array.

  4. Now the keys of e are the distinct numbers i in the array, and the value e[i] has a set bit corresponding to each occurrence of i in the array. For each value e[i] that is found in the dictionary d, all the patterns in the set d[e[i]] match the array.

(Note: in practice you'd build the bitsets the other way round, and use 2j at step 3.3 instead of 2ℓ − j − 1, but I've described the algorithm this way for clarity of exposition.)

Here's an example. Suppose we have the patterns AABBA and ACBBA. In the preprocessing step AABBA turns into the number 25 (11001 in binary), and ACBBA turns into the numbers 25 (11001 in binary) and 17 (10001 in binary), for the two possible subsets of the occurrences of C in the pattern. So the dictionary d looks like this:

  • 17 → {ACBBA}
  • 25 → {AABBA, ACBBA}

After processing the array {1, 2, 3, 5, 1} we have e = {1 → 17, 2 → 8, 3 → 4, 5 → 2}. The value e[1] = 17 is found in d, so this input matches the pattern ACBBA.

After processing the array {1, 1, 2, 3, 1} we have e = {1 → 25, 2 → 4, 3 → 2}. The value e[1] = 25 is found in d, so this input matches the patterns AABBA and ACBBA.

Gareth Rees
  • 64,967
  • 9
  • 133
  • 163
  • BAABC is not the same as ABBAC, as "Numbers represented by B don't have to be equal" – colinfang Dec 15 '12 at 21:11
  • @colinfang: Thanks: I should have read more carefully. Here's a different idea. – Gareth Rees Dec 15 '12 at 22:00
  • I've made the description of the algorithm even more explicit. We'll get there in the end, I hope. – Gareth Rees Dec 16 '12 at 01:08
  • Thank you, i gotcha finally. – colinfang Dec 16 '12 at 01:21
  • @colinfang: Re your "Update 2", if you have 30 patterns with 10 occurrences of **C** per pattern, then the dictionary *d* contains up to 30,720 keys. This doesn't seem like very many, compared with 10,000,000 arrays. Also "many unnecessary comparisons" seems wrong: there's just one hash table lookup for each unique number in each input array: hard to see how you could do better than that. – Gareth Rees Dec 17 '12 at 14:00
0

Get the index of the first A in the pattern, get the value for that position, then loop through the positions.

To check if the array array matches the pattern in the string pattern, the result is in the boolean match:

int index = pattern.IndexOf('A');
int value = array[index];
bool match = true;
for (int i = 0; i < array.Length; i++) {
  if (pattern[i] != 'C' && i != index) {
    if ((pattern[i] == 'A') != (array[i] == value)) {
      match = false;
      break;
    }
  }
}
Guffa
  • 687,336
  • 108
  • 737
  • 1,005
  • 1
    hmm am I missing something, how is this not O(N*M)? – Woot4Moo Dec 15 '12 at 19:23
  • 1
    @Woot4Moo It is *O(N)*, because there are at most 2 sequential linear scans (1st - IndexOf, 2nd - for loop). So it is: O(N) + O(M) = O(N). – oleksii Dec 15 '12 at 19:30
  • 1
    The posted code is O(N), but it must be run M times due to there being M different patterns. This solution is O(N*M). – goat Dec 15 '12 at 20:25
  • 1
    As the length of the pattern is the same as the length of the array, the code is O(N+N), which is the same as O(N). Looping through the patterns get you O(N*M). – Guffa Dec 15 '12 at 20:27
  • 1
    How is it different from my "straightforward but inefficient" way? O(N * M) is not efficient. Say, if AAAAA and AAAAC are both in the pre defined pattern set. Then if AAAAA matches, AAAAC matches as well. I am looking for an algorithm which could recognize that. – colinfang Dec 15 '12 at 23:59
  • Sorry I forgot to mention that the pattern and array are always of the same length. – colinfang Dec 16 '12 at 00:11
  • @colinfang: It's an eficcient way of doing it. Both the `IndexOf` and the matching loop are short circuiting, so the actual performance is somewhere between O(1*M) and O(N*M). – Guffa Dec 16 '12 at 00:59
  • First, the number of patterns is not relevant here as they must be considered as different problems because there are not relationship between any of them: you cannot diminish the time it take to solve the second pattern based on the solution found for the first. However, what you can do is to try to diminish the time it take for each integer array by storing the values of the index (the location of the first 'A') and all of the subsequent 'A' and 'B'. This way, you can directly retrieve the values required for the boolean tests after the first array without having to re-scan the pattern. – SylvainL Dec 16 '12 at 02:06
  • 1
    @SylvainL: The number of patterns is relevant if the OP says so, and he has. It's not possible to do better asymptotically than Guffa's O(N) solution *if M=1*, but for larger M it may well be possible to do better than O(NM) by rearranging, combining or otherwise preprocessing the set of all M patterns. It's often possible to decrease the time needed to solve a second pattern after solving the first one -- e.g if both consist of all As and Cs, and the set of A positions in pattern 2 is a superset of the set of A positions in pattern 1, then it's not necessary to recheck those positions. – j_random_hacker Dec 16 '12 at 07:18
  • This is true if there is a relationship that exists between the patterns; however, the OP did not make any mention of it, therefore we cannot analyse it and we must consider that all the patterns are independant. Also, as a sideline, the Guffa solution is not linear but logarithmic: if you increase the length of the patterns by a factor of two, you don't expect to double the required time because you can break the scan for the unmatches at the first mismatched character. (This is for the given examples of patterns where the distribution of the letters A, B and C looks more or less random.) – SylvainL Dec 16 '12 at 08:19
  • @SylvainL: Even though there are no constraints given on the patterns, to claim that no solution better than O(NM) is possible it has to be proven that one cannot be discovered, and that hasn't been done yet. Until then it's possible that someone will come up with a way to arrange or preprocess the patterns that guarantees better-than-O(NM) time performance for *any* set of M patterns. Regarding your O(log N) claim, that may be true *on average* for some distributions of patterns, but Guffa's solution remains O(N) in the worst case. – j_random_hacker Dec 16 '12 at 10:32
  • @SylvainL sorry for the confusion. Please see my updated question. – colinfang Dec 16 '12 at 14:22