Questions tagged [string-matching]

String matching is the problem of finding occurrences of one string (“pattern”, “needle”) in another (“text”, “haystack”).

There are two types of string matching:

  • Exact
  • Approximate

Exact string matching is the problem of finding occurrence(s) of a pattern string within another string or body of text. (NIST). For example, finding CGATCGATTA in CTAGATCCTGCGATCGATTAAGCCTGA.

A comprehensive online reference of string matching algorithms is Exact String Matching Algorithms by Christian Charras and Thierry Lecroq.

Approximate string matching, also called fuzzy string matching, searches for matches based on the edit distance between the pattern and the text.

2278 questions
32
votes
3 answers

Removing an item from list matching a substring

How do I remove an element from a list if it matches a substring? I have tried removing an element from a list using the pop() and enumerate method but seems like I'm missing a few contiguous items that needs to be removed: sents = ['@$\tthis…
alvas
  • 115,346
  • 109
  • 446
  • 738
30
votes
6 answers

Remove ends of string entries in pandas DataFrame column

I have a pandas Dataframe with one column a list of files import pandas as pd df = pd.read_csv('fname.csv') df.head() filename A B C fn1.txt 2 4 5 fn2.txt 1 2 1 fn3.txt .... .... I would like to delete the file…
ShanZhengYang
  • 16,511
  • 49
  • 132
  • 234
26
votes
2 answers

agrep: only return best match(es)

I'm using the 'agrep' function in R, which returns a vector of matches. I would like a function similar to agrep that only returns the best match, or best matches if there are ties. Currently, I am doing this using the 'sdist()' function from the…
Zach
  • 29,791
  • 35
  • 142
  • 201
26
votes
3 answers

Normalizing the edit distance

I have a question that can we normalize the levenshtein edit distance by dividing the e.d value by the length of the two strings? I am asking this because, if we compare two strings of unequal length, the difference between the lengths of the two…
25
votes
11 answers

Fuzzy matching of product names

I need to automatically match product names (cameras, laptops, tv-s etc) that come from different sources to a canonical name in the database. For example "Canon PowerShot a20IS", "NEW powershot A20 IS from Canon" and "Digital Camera Canon PS A20IS"…
Ash
25
votes
14 answers

Search for string allowing for one mismatch in any location of the string

I am working with DNA sequences of length 25 (see examples below). I have a list of 230,000 and need to look for each sequence in the entire genome (toxoplasma gondii parasite). I am not sure how large the genome is, but much longer than 230,000…
Vincent
  • 1,579
  • 4
  • 23
  • 38
24
votes
2 answers

XPath partial of attribute known

I know the partial value of an attribute in a document, but not the whole thing. Is there a character I can use to represent any value? For example, a value of a label for an input is "A. Choice 1". I know it says "Choice 1", but not whether it…
avaleske
  • 1,793
  • 5
  • 16
  • 26
24
votes
2 answers

Regex for existence of some words whose order doesn't matter

I would like to write a regex for searching for the existence of some words, but their order of appearance doesn't matter. For example, search for "Tim" and "stupid". My regex is Tim.*stupid|stupid.*Tim. But is it possible to write a simpler regex…
Tim
  • 1
  • 141
  • 372
  • 590
23
votes
4 answers

what is a good metric for deciding if 2 Strings are "similar enough"

I'm working on a very rough, first-draft algorithm to determine how similar 2 Strings are. I'm also using Levenshtein Distance to calculate the edit distance between the Strings. What I'm doing currently is basically taking the total number of edits…
Hristo
  • 45,559
  • 65
  • 163
  • 230
23
votes
15 answers

Delete duplicate strings in string array

I am making a program based on string processing in Java in which I need to remove duplicate strings from a string array. In this program, the size of all strings are same. The 'array' which is a string array contains a number of strings in which…
user1339752
21
votes
7 answers

c# string comparison method returning index of first non match

Is there an exsting string comparison method that will return a value based on the first occurance of a non matching character between two strings? i.e. string A = "1234567890" string B = "1234567880" I would like to get a value back that would…
Andy
  • 425
  • 1
  • 5
  • 16
21
votes
3 answers

Find numbers after specific text in a string with RegEx

I have a multiline string like the following: 2012-15-08 07:04 Bla bla bla blup 2012-15-08 07:05 *** Error importing row no. 5: The import of this line failed because bla bla 2012-15-08 07:05 Another text that I don't want to search... 2012-15-08…
Patric
  • 2,789
  • 9
  • 33
  • 60
20
votes
4 answers

strstr faster than algorithms?

I have a file that's 21056 bytes. I've written a program in C that reads the entire file into a buffer, and then uses multiple search algorithms to search the file for a token that's 82 chars. I've used all the implementations of the algorithms from…
Josh
  • 6,046
  • 11
  • 52
  • 83
20
votes
5 answers

Python: optimal search for substring in list of strings

I have a particular problem where I want to search for many substrings in a list of many strings. The following is the gist of what I am trying to do: listStrings = [ACDE, CDDE, BPLL, ... ] listSubstrings = [ACD, BPI, KLJ, ...] The above entries…
Alopex
  • 323
  • 2
  • 8
20
votes
2 answers

Fast partial string matching in R

Given a vector of strings texts and a vector of patterns patterns, I want to find any matching pattern for each text. For small datasets, this can be easily done in R with grepl: patterns = c("some","pattern","a","horse") texts = c("this is a text…
Mulone
  • 3,603
  • 9
  • 47
  • 69