Questions tagged [string-algorithm]
70 questions
14
votes
4 answers
Find substring in text which has the highest similarity to a given keyword
Say I have this text = I love apples, kiwis, oranges and bananas and the searchString = kiwis and bananas and a similarity algorithm say Jaccard index. How can I efficiently find the substring in text which has the highest similarity to…

pathikrit
- 32,469
- 37
- 142
- 221
12
votes
5 answers
Find the words in a long stream of characters. Auto-tokenize
How would you find the correct words in a long stream of characters?
Input :
"The revised report onthesyntactictheoriesofsequentialcontrolandstate"
Google's Output:
"The revised report on syntactic theories sequential controlandstate"
(which is…

unj2
- 52,135
- 87
- 247
- 375
10
votes
3 answers
Fast substring search algorithm to be used by a sort of IDE with tens of thousands of very big files
I'm developing something quite similar to an IDE that will handle tens of thousands of very large (text) files and I'm surveying what the state of the art in the subject is.
As an example, Intellij's searching algorithm for standard (non-regex)…

devoured elysium
- 101,373
- 131
- 340
- 557
8
votes
3 answers
Longest palindrome prefix
How to find the longest palindrome prefix of a string in O(n)?

Neerad
- 101
- 1
- 3
8
votes
5 answers
Books on string algorithms
There have been numerous posts on string algorithms:
Algorithm to find articles with similar text,
Similar String algorithm,
Efficient string matching algorithm
However, no general literature was mentioned.
Could anyone recommend a book(s) that…

Max
- 1,741
- 3
- 23
- 40
8
votes
2 answers
Alternative to Levenshtein distance for prefixes / suffixes
I have a big city database which was compiled from many different sources. I am trying to find a way to easily spot duplicates based on city name. The naive answer would be to use the levenshtein distance. However, the problem with cities is that…

scottmrogowski
- 2,063
- 4
- 23
- 32
6
votes
4 answers
Faster Aho-Corasick PHP implementation
Is there a working implementation of Aho–Corasick in PHP? There is one Aho-Corasick string matching in PHP mentioned on the Wikipedia article:

Nikola Obreshkov
- 1,698
- 4
- 21
- 32
6
votes
7 answers
Constant-time hash for strings?
Another question on SO brought up the facilities in some languages to hash strings to give them a fast lookup in a table. Two examples of this are dictionary<> in .NET and the {} storage structure in Python. Other languages certainly support such a…

San Jacinto
- 8,774
- 5
- 43
- 58
5
votes
3 answers
Find longest adjacent repeating non-overlapping substring
(This question isn't about music but I'm using music as an example of
a use case.)
In music a common way to structure phrases is as a sequence of notes
where the middle part is repeated one or more times. Thus, the phrase
consists of an…

Björn Lindqvist
- 19,221
- 20
- 87
- 122
5
votes
1 answer
What is a generalized suffix tree?
I saw the Wikipedia page but still am not clear with the idea.
To find the longest common substring of 2 strings (T and S), I've read that we must build a suffix tree for the string T($1)S($2), where`($1) and ($2) are special characters not part of…

batman
- 5,022
- 11
- 52
- 82
4
votes
2 answers
Algorithm for finding bags of elements in a sequence
Say that I have a sequence of elements of interest A, B, C... interspersed with don't care symbols x. I want to identify bags of elements from a predefined set of interesting combinations that happen within a predefined distance. There can be…

tonicebrian
- 4,715
- 5
- 41
- 65
4
votes
4 answers
Find all concatenations of two string in a huge set
Given a set of 50k strings, I need to find all pairs (s, t), such that s, t and s + t are all contained in this set.
What I've tried
, there's an additional constraint: s.length() >= 4 && t.length() >= 4. This makes it possible to group the strings…

maaartinus
- 44,714
- 32
- 161
- 320
4
votes
1 answer
How to find number of 010 in a certain range of a binary string
Given a binary string. How to find occurances of "010" within a certain range of the string.For example, I have the string "0100110" . If the given range is 3 7 ( 1 based indexing ) then the output will be 4. I could not find any faster way to solve…

ssavi
- 112
- 1
- 12
4
votes
3 answers
python str.index time complexity
For finding the position of a substring, inside a string, a naive algorithm will take O(n^2) time. However, using some efficient algorithms (eg KMP algorithm), this can be achieved in O(n) time:
s = 'saurabh'
w = 'au'
def get_table():
i = 0; j…

Saurabh Verma
- 6,328
- 12
- 52
- 84
3
votes
2 answers
Multiple keyword (100s to 1000s) search (string-search algorithm) in PHP
I have this problem to solve in my PHP project where some keywords (from a few hundreds to a few thousands, lengths can vary) need to be searched in a string about 100-300 characters long, sometimes of lesser length 30-50 chars. I can preprocess the…

aditya
- 143
- 6