Questions tagged [string-matching]

String matching is the problem of finding occurrences of one string (“pattern”, “needle”) in another (“text”, “haystack”).

There are two types of string matching:

  • Exact
  • Approximate

Exact string matching is the problem of finding occurrence(s) of a pattern string within another string or body of text. (NIST). For example, finding CGATCGATTA in CTAGATCCTGCGATCGATTAAGCCTGA.

A comprehensive online reference of string matching algorithms is Exact String Matching Algorithms by Christian Charras and Thierry Lecroq.

Approximate string matching, also called fuzzy string matching, searches for matches based on the edit distance between the pattern and the text.

2278 questions
13
votes
1 answer

Find matches of a vector of strings in another vector of strings

I'm trying to create a subset of a data frame of news articles that mention at least one element of a set of keywords or phrases. # Sample data frame of articles articles <- data.frame(id=c(1, 2, 3, 4), text=c("Lorem ipsum dolor sit amet,…
Andrew
  • 36,541
  • 13
  • 67
  • 93
13
votes
2 answers

Which algorithm is being used in Android's spell checker?

I am doing some research on string matching algorithms. One of the most usable I came across is the one my cellphone uses (android 2.3.4 on SE xPeria neo v). As seen in the screenshot, I pressed the characters jiw which are near the ones I wanted…
Odys
  • 8,951
  • 10
  • 69
  • 111
12
votes
3 answers

r dplyr ends_with multiple string matches

Can I use dplyr::select(ends_with) to select column names that fit any of multiple conditions. Considering my column names, I want to use ends with instead of contains or matches, because the strings I want to select are relevant at the end of the…
user42485
  • 751
  • 2
  • 9
  • 19
12
votes
1 answer

Joining two datasets using fuzzy logic

I’m trying to do a fuzzy logic join in R between two datasets: first data set has the name of a location and a column called config second data set has the name of a location and two additional attributes that need to be summarized before they are…
steppermotor
  • 701
  • 6
  • 22
12
votes
2 answers

Excel - Match substring from list of choices - INDEX, MATCH, and FIND used together

I'd like to search for a specific movie title within a list of video titles, search for MATCH, and use Index to return its description. I know this can be done with a text search in a filter via Column A, but I'd like to do it with a…
James
  • 209
  • 3
  • 13
12
votes
5 answers

What is the best algorithm for matching two string containing less than 10 words in latin script

I'm comparing song titles, using Latin script (although not always), my aim is an algorithm that gives a high score if the two song titles seem to be the same same title and a very low score if they have nothing in common. Now I already had to code…
Paul Taylor
  • 13,411
  • 42
  • 184
  • 351
12
votes
1 answer

Log4Net StringMatchFilter is not filtering anything

I'm logging all SQL generated by nHibernate because we have a weird issue. This alone generates huge logs so I'm trying to shorten them up a bit by trying to only log lines that contain a certain ID. It still seems like everything is coming…
Ryan Bosinger
  • 1,832
  • 2
  • 16
  • 25
12
votes
2 answers

Damerau–Levenshtein distance (Edit Distance with Transposition) c implementation

I implemented the Damerau–Levenshtein distance in c++ but it does not give correct o/p for the input (pantera,aorta) the correct o/p is 4 but my code gives 5..... int editdist(string s,string t,int n,int m) { int d1,d2,d3,cost; int i,j; …
user1413523
  • 345
  • 1
  • 7
  • 15
11
votes
1 answer

Best library for fuzzy document match / text fingerprinting

I am thinking of building an API that would let a program submit a "fingerprint" of an academic publication, match this against a database of articles from Open Access journals, and if found, send the user the canonical citation information.…
Stian Håklev
  • 1,240
  • 2
  • 14
  • 26
11
votes
3 answers

fuzzy version of stringr::str_detect for filtering dataframe

I've got a database with free text fields that I want to use to filter a data.frame or tibble. I could perhaps with lots of work create a list of all possible misspellings of my search terms that currently occur in the data (see example of all the…
11
votes
6 answers

Match vectors in sequence

I have 2 vectors. x=c("a", "b", "c", "d", "a", "b", "c") y=structure(c(1, 2, 3, 4, 5, 6, 7, 8), .Names = c("a", "e", "b", "c", "d", "a", "b", "c")) I would like to match a to a, b to b in sequence accordingly, so that x[2] matches y[3] rather…
Sati
  • 716
  • 6
  • 27
11
votes
2 answers

Rating the quality of string matches

What would be the best way to compare a pattern with a set of strings, one by one, while rating the amount with which the pattern matches each string? In my limited experience with regex, matching strings with patterns using regex seems to be a…
user35288
11
votes
5 answers

Check for a key pattern in a dictionary in python

dict1=({"EMP$$1":1,"EMP$$2":2,"EMP$$3":3}) How to check if EMP exists in the dictionary using python dict1.get("EMP##") ??
Rajeev
  • 44,985
  • 76
  • 186
  • 285
11
votes
1 answer

String Matching Using Recurrent Neural Networks

I have recently started exploring Recurrent Neural Networks. So far I have trained character level language model on tensorFlow using Andrej Karpathy's blog. It works great. I couldnt however find any study on using RNNs for string matching or…
11
votes
3 answers

Python Fuzzy Matching (FuzzyWuzzy) - Keep only Best Match

I'm trying to fuzzy match two csv files, each containing one column of names, that are similar but not the same. My code so far is as follows: import pandas as pd from pandas import DataFrame from fuzzywuzzy import process import csv save_file =…
Kvothe
  • 1,341
  • 7
  • 20
  • 33