2

I have 3 text files. One with a set of text to be searched through
(ex. ABCDEAABBCCDDAABC)
One contains a number of patterns to search for in the text
(ex. AB, EA, CC)
And the last containing the frequency of each character
(ex.
A 4
B 4
C 4
D 3
E 1
)
I am trying to write an algorithm to find the least frequent occurring character for each pattern and search a string for those occurrences, then check the surrounding letters to see if the string is a match. Currently, I have the characters and frequencies in their own vectors, respectively. (Where i=0 for each vector would be A 4, respectively.

Is there a better way to do this? Maybe a faster data structure? Also, what are some efficient ways to check the pattern string against the piece of the text string once the least frequent letter is found?

Racehorse35
  • 121
  • 10

2 Answers2

2

You can run the Aho-Corasick algorithm. Its complexity (once the preprocessing - whose complexity is unrelated to the text - is done), is Θ(n + p), where

  • n is the length of the text

  • p is the total number of matches found

This is essentially optimal. There is no point in trying to skip over letters that appear to be frequent:

  1. If the letter is not part of a match, the algorithm takes unit time.

  2. If the letter is part of a match, then the match includes all letters, irrespective of their frequency in the text.

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
0

You could run an iteration loop that keeps a count of instances and has a check to see if a character has appeared more than a percentage of times based on total characters searched for and total length of the string. i.e. if you have 100 characters and 5 possibilities, any character that has appeared more than 20% of the hundred can be discounted, increasing efficiency by passing over any value matching that one.

DRoc101
  • 63
  • 8