String Finding Alg w/ Lowest Freq Char

Question

I have 3 text files. One with a set of text to be searched through
(ex. ABCDEAABBCCDDAABC)
One contains a number of patterns to search for in the text
(ex. AB, EA, CC)
And the last containing the frequency of each character
(ex.
A 4
B 4
C 4
D 3
E 1
)
I am trying to write an algorithm to find the least frequent occurring character for each pattern and search a string for those occurrences, then check the surrounding letters to see if the string is a match. Currently, I have the characters and frequencies in their own vectors, respectively. (Where i=0 for each vector would be A 4, respectively.

Is there a better way to do this? Maybe a faster data structure? Also, what are some efficient ways to check the pattern string against the piece of the text string once the least frequent letter is found?

https://en.wikipedia.org/wiki/Boyer–Moore_string_search_algorithm — danh, Oct 03 '16 at 16:23

Ami Tavory · Accepted Answer · 2016-10-03T20:39:44.820

You can run the Aho-Corasick algorithm. Its complexity (once the preprocessing - whose complexity is unrelated to the text - is done), is Θ(n + p), where

n is the length of the text
p is the total number of matches found

This is essentially optimal. There is no point in trying to skip over letters that appear to be frequent:

If the letter is not part of a match, the algorithm takes unit time.
If the letter is part of a match, then the match includes all letters, irrespective of their frequency in the text.

score 0 · Answer 2 · answered Oct 03 '16 at 16:19

You could run an iteration loop that keeps a count of instances and has a check to see if a character has appeared more than a percentage of times based on total characters searched for and total length of the string. i.e. if you have 100 characters and 5 possibilities, any character that has appeared more than 20% of the hundred can be discounted, increasing efficiency by passing over any value matching that one.

String Finding Alg w/ Lowest Freq Char

2 Answers2