Questions tagged [string-matching]

String matching is the problem of finding occurrences of one string (“pattern”, “needle”) in another (“text”, “haystack”).

There are two types of string matching:

  • Exact
  • Approximate

Exact string matching is the problem of finding occurrence(s) of a pattern string within another string or body of text. (NIST). For example, finding CGATCGATTA in CTAGATCCTGCGATCGATTAAGCCTGA.

A comprehensive online reference of string matching algorithms is Exact String Matching Algorithms by Christian Charras and Thierry Lecroq.

Approximate string matching, also called fuzzy string matching, searches for matches based on the edit distance between the pattern and the text.

2278 questions
6
votes
9 answers

Database/datasource optimized for string matching?

I want to store large amount (~thousands) of strings and be able to perform matches using wildcards. For example, here is a sample content: Folder1 Folder1/Folder2 Folder1/* Folder1/Folder2/Folder3 Folder2/Folder* */Folder4 */Fo*4 (each line has…
Matthieu Napoli
  • 48,448
  • 45
  • 173
  • 261
6
votes
3 answers

Identify Columns Index Matching Given Vector of String

I've got a vector of string x<-c('a','b') and I have an matrix with multiple columnsl; which contains names in that vector of string. I would like to get the column numbers/index which matches their names. which(colnames(sample_matrix) == x) This…
user1234440
  • 22,521
  • 18
  • 61
  • 103
6
votes
3 answers

Efficient algorithm for string matching with a very large pattern set

I'm looking for an efficient algorithm able to find all patterns that match a specific string. The pattern set can be very large (more than 100,000) and dynamic (patterns added or removed at anytime). Patterns are not necessarily standard regexp,…
lquerel
  • 153
  • 1
  • 8
6
votes
1 answer

Fast and efficient computation on arrays

I want to count the number of occurances for a particular phrase in a document. For example "stackoverflow forums". Suppose D represents the documents set with document containing both terms. Now, suppose I have the following data…
DotNet
  • 697
  • 2
  • 7
  • 23
6
votes
3 answers

how to implement near matches of strings in java?

Hello fellow programmers, I would like to ask for some help with regards to near matches of strings. Currently, I have a program that stores strings of description, users can search for description by typing it completely or partially. I would…
melyong
  • 83
  • 1
  • 7
6
votes
3 answers

How can I find the first occurrence of a substring occurring after another substring in python?

Strings in Python have a find("somestring") method that returns the index number for "somestring" in your string. But let's say I have a string like the following: "$5 $7 $9 Total Cost: $35 $14" And I want to find the index of the first…
CQP
  • 937
  • 2
  • 11
  • 17
6
votes
1 answer

Record Matching algorithms for an inconsistent dataset

I'm working with a large dataset of products(~1 million). These products come from many different sources and thus the way they have data listed in inconsistent. One of the big issues is variances product Brand names (~17,000 unique brands). …
NSjonas
  • 10,693
  • 9
  • 66
  • 92
6
votes
2 answers

Finding a Russian character in NSString

I have to check out whether a russian character is present in a NSString or not. I am using the following code for that: NSCharacterSet * set = [[NSCharacterSet characterSetWithCharactersInString:@"БГДЁЖИЙЛПФХЦЧШЩЪЫЭЮЯ"] invertedSet]; BOOL…
Rachit
  • 1,159
  • 3
  • 13
  • 23
5
votes
2 answers

Order-independent fuzzy matching of "Firstname Lastname"/"Lastname Firstname" in R?

I have two lists of names for the same set of students which have been collected separately. There are numerous typographical errors and I have been using fuzzy matching to link the two lists. I am 99+% there with agrep and similar, but am stuck on…
Jonathan Burley
  • 771
  • 1
  • 6
  • 8
5
votes
5 answers

regex to match html tags with specific attributes

I am trying to match all HTML tags that do not have the attribute "term" or "range" here is sample HTML format DATE: 12/01/10 MR: 1234567
user253530
  • 2,583
  • 13
  • 44
  • 61
5
votes
1 answer

Real time Prefix matching and auto-complete in Quora

How is real time autocomplete with prefix matching implemented in Quora ? Since Solr and Sphinx doesn't support real-time updating so what changes were made to support real time updating?
r15habh
  • 1,468
  • 3
  • 19
  • 31
5
votes
4 answers

Powershell binary grep

Is there a way to determine whether a specified file contains a specified byte array (at any position) in powershell? Something like: fgrep --binary-files=binary "$data" "$filepath" Of course, I can write a naive implementation: function…
Sasha
  • 3,599
  • 1
  • 31
  • 52
5
votes
1 answer

Product name string matching against a trie (supporting omissions)

I have a list of CPU models. Right now, I think the most suitable approach would be forming a trie from the list, like this: Intel -- Core -- i -- 3 | | |- 5 | | |- 7 | | -- 9 | | | …
5
votes
1 answer

Tips for efficient string matching (and indexing) for large data in R?

What I want to do I have a number of unique ids, e.g. id1, id2, etc.. They appear in a number of groups, and each group is a random sample of between 1 and 100 ids, e.g. [1] "id872- id103- id746-" [2] "id830- id582-" …
5
votes
4 answers

R: Compare character strings across multiple columns to character string in a single column by row

I am trying to create a variable that is a logical value when comparing one character string to more than two other character strings in a data.table and I need to ignore NA's. Sample data for D2: structure(list(ID = c("a001", "a002", "a003"), var1…
user3594490
  • 1,949
  • 2
  • 20
  • 26