Questions tagged [string-algorithm]

70 questions
14
votes
4 answers

Find substring in text which has the highest similarity to a given keyword

Say I have this text = I love apples, kiwis, oranges and bananas and the searchString = kiwis and bananas and a similarity algorithm say Jaccard index. How can I efficiently find the substring in text which has the highest similarity to…
pathikrit
  • 32,469
  • 37
  • 142
  • 221
12
votes
5 answers

Find the words in a long stream of characters. Auto-tokenize

How would you find the correct words in a long stream of characters? Input : "The revised report onthesyntactictheoriesofsequentialcontrolandstate" Google's Output: "The revised report on syntactic theories sequential controlandstate" (which is…
unj2
  • 52,135
  • 87
  • 247
  • 375
10
votes
3 answers

Fast substring search algorithm to be used by a sort of IDE with tens of thousands of very big files

I'm developing something quite similar to an IDE that will handle tens of thousands of very large (text) files and I'm surveying what the state of the art in the subject is. As an example, Intellij's searching algorithm for standard (non-regex)…
devoured elysium
  • 101,373
  • 131
  • 340
  • 557
8
votes
3 answers

Longest palindrome prefix

How to find the longest palindrome prefix of a string in O(n)?
Neerad
  • 101
  • 1
  • 3
8
votes
5 answers

Books on string algorithms

There have been numerous posts on string algorithms: Algorithm to find articles with similar text, Similar String algorithm, Efficient string matching algorithm However, no general literature was mentioned. Could anyone recommend a book(s) that…
Max
  • 1,741
  • 3
  • 23
  • 40
8
votes
2 answers

Alternative to Levenshtein distance for prefixes / suffixes

I have a big city database which was compiled from many different sources. I am trying to find a way to easily spot duplicates based on city name. The naive answer would be to use the levenshtein distance. However, the problem with cities is that…
scottmrogowski
  • 2,063
  • 4
  • 23
  • 32
6
votes
4 answers

Faster Aho-Corasick PHP implementation

Is there a working implementation of Aho–Corasick in PHP? There is one Aho-Corasick string matching in PHP mentioned on the Wikipedia article:
Nikola Obreshkov
  • 1,698
  • 4
  • 21
  • 32
6
votes
7 answers

Constant-time hash for strings?

Another question on SO brought up the facilities in some languages to hash strings to give them a fast lookup in a table. Two examples of this are dictionary<> in .NET and the {} storage structure in Python. Other languages certainly support such a…
San Jacinto
  • 8,774
  • 5
  • 43
  • 58
5
votes
3 answers

Find longest adjacent repeating non-overlapping substring

(This question isn't about music but I'm using music as an example of a use case.) In music a common way to structure phrases is as a sequence of notes where the middle part is repeated one or more times. Thus, the phrase consists of an…
5
votes
1 answer

What is a generalized suffix tree?

I saw the Wikipedia page but still am not clear with the idea. To find the longest common substring of 2 strings (T and S), I've read that we must build a suffix tree for the string T($1)S($2), where`($1) and ($2) are special characters not part of…
batman
  • 5,022
  • 11
  • 52
  • 82
4
votes
2 answers

Algorithm for finding bags of elements in a sequence

Say that I have a sequence of elements of interest A, B, C... interspersed with don't care symbols x. I want to identify bags of elements from a predefined set of interesting combinations that happen within a predefined distance. There can be…
tonicebrian
  • 4,715
  • 5
  • 41
  • 65
4
votes
4 answers

Find all concatenations of two string in a huge set

Given a set of 50k strings, I need to find all pairs (s, t), such that s, t and s + t are all contained in this set. What I've tried , there's an additional constraint: s.length() >= 4 && t.length() >= 4. This makes it possible to group the strings…
maaartinus
  • 44,714
  • 32
  • 161
  • 320
4
votes
1 answer

How to find number of 010 in a certain range of a binary string

Given a binary string. How to find occurances of "010" within a certain range of the string.For example, I have the string "0100110" . If the given range is 3 7 ( 1 based indexing ) then the output will be 4. I could not find any faster way to solve…
ssavi
  • 112
  • 1
  • 12
4
votes
3 answers

python str.index time complexity

For finding the position of a substring, inside a string, a naive algorithm will take O(n^2) time. However, using some efficient algorithms (eg KMP algorithm), this can be achieved in O(n) time: s = 'saurabh' w = 'au' def get_table(): i = 0; j…
Saurabh Verma
  • 6,328
  • 12
  • 52
  • 84
3
votes
2 answers

Multiple keyword (100s to 1000s) search (string-search algorithm) in PHP

I have this problem to solve in my PHP project where some keywords (from a few hundreds to a few thousands, lengths can vary) need to be searched in a string about 100-300 characters long, sometimes of lesser length 30-50 chars. I can preprocess the…
aditya
  • 143
  • 6
1
2 3 4 5