Questions tagged [agrep]

An approximate grep for fuzzy matching

agrep (approximate ) is a proprietary fuzzy string searching program, developed by Udi Manber and Sun Wu between 1988 and 1991, for use with the operating system. It was later ported to OS/2, DOS, and Windows. It selects the best-suited algorithm for the current query from a variety of the known fastest (built-in) string searching algorithms, including Manber and Wu's bitap algorithm based on Levenshtein distances. agrep is also the search engine in the indexer program GLIMPSE. agrep is free for private and non-commercial use only, and belongs to the University of Arizona.

89 questions
2
votes
2 answers

Fuzzy matching strings within a single column and documenting possible matches

I have a relatively large dataset of ~ 5k rows containing titles of journal/research papers. Here is a small sample of the dataset: dt = structure(list(Title = c("Community reinforcement approach in the treatment of opiate addicts", "Therapeutic…
jmogil
  • 143
  • 8
2
votes
1 answer

What is the logic of approximate string matching?

Does anybody know what is the reason for the following example: agrepl("cold", "cool") #> [1] FALSE agrepl("cool", "cold") #> [1] TRUE
2
votes
1 answer

Extract substring match from agrep

My Goal is to identify whether a given text has a target string in it, but i want to allow for typos / small derivations and extract the substring that "caused" the match (to use it for further text analysis). Example: target <- "target string" text…
Tlatwork
  • 1,445
  • 12
  • 35
2
votes
3 answers

How to fix error agrep: pattern too long (has > 32 chars) it doesn't show error if there is no full stop in the string?

agrep gives the error agrep: pattern too long (has > 32 chars) when there is a full stop(.) in the pattern string but not otherwise. I want to compare(approximately) two strings, so I'm using agrep for that but its giving an error agrep: pattern too…
Manik
  • 573
  • 1
  • 9
  • 28
2
votes
3 answers

Identify fuzzy duplicates from a single column and create a subset containing records of fuzzy duplicates using R

I have a dataset which contains a field with individual's name. Some of the names are similar with minute differences like 'CANON INDIA PVT. LTD' and 'CANON INDIA PVT. LTD.', 'Antila,Thomas' and 'ANTILA THOMAS', 'Z_SANDSTONE COOLING LTD' and…
Jazz
  • 125
  • 7
2
votes
1 answer

R: Fuzzy merge using agrep and data.table

I try to merge two data.tables, but due to different spelling in stock names I lose a substantial number of data points. Hence, instead of an exact match I was looking into a fuzzy merge. library("data.table") dt1 = data.table(Name = c("ASML…
Hjalmar
  • 122
  • 10
2
votes
2 answers

Partial string matching in R and trim the characters

Here is a dataframe and a vector. df1 <- tibble(var1 = c("abcd", "efgh", "ijkl", "mnopqr", "qrst")) vec <- c("ab", "mnop", "ijk") Now, for all the values in var1 that matches closest (I would like to match the first n characters) with the values…
Geet
  • 2,515
  • 2
  • 19
  • 42
2
votes
1 answer

fuzzy string matching with agrep()

I´m matching a list of company names against itself with R and agrep() because the data was stored wrong in a legacy system - No 4th normal form, companys were recorded on the same level as customers, which means a new company entry for every new…
Salfii
  • 87
  • 1
  • 1
  • 9
2
votes
2 answers

Alternative approach to using agrep() for fuzzy matching in R

I have a large file of administrative data, about 1 million records. Individual people can be represented multiple times in this dataset. About half the records have an identifying code that maps records to individuals; for the half that don't, I…
edstatsuser
  • 220
  • 2
  • 7
2
votes
1 answer

Fuzzy mapping in R

I am trying to use agrep command for fuzzy matching. I have a data frame in which one column contains the audience response and another dataframe in which segment and subsegment are listed. the column audience response contains the words that are…
Shaz
  • 25
  • 4
2
votes
1 answer

R agrep() function behaviour

I have some trouble to understand the result of agrep() function. I don't understand what I have missed in the description of the function. agrep() is for fuzzy matching and I'd like to use it to correct some misspelling. I'd like to allow only a…
Vivien
  • 67
  • 6
2
votes
0 answers

R: slow fuzzy matching with agrep

I have a vector of patterns and a large vector of potential match candidates. For each element in x I use agrep to obtain a list of close matches in y. Problem is that the code is very slow - it takes approximately 2 seconds per each element from x.…
Alexey Ferapontov
  • 5,029
  • 4
  • 22
  • 39
2
votes
1 answer

agrep string matching in R

I have two list of some product names. My problem is "Operating system" is matching with "system", "cooling system",etc. But it has to match only with "Operating","OS". Another example is "Key Board" should be matched with "key" or "KB" but not with…
Kavipriya
  • 441
  • 4
  • 17
2
votes
6 answers

SQL: match a string pattern irrespective of it's case, whitespaces in a column

I need to find the frequency of a string in a column, irrespective of its case and any white spaces. For example, if my string is My Tec Bits and they occur in my table like this, as shown below : 061 MYTECBITS 12123 102 mytecbits 24324 103…
sunitprasad1
  • 768
  • 2
  • 12
  • 28
2
votes
2 answers

Approximate string matching with a letter confusion matrix?

I'm trying to model a phonetic recognizer that has to isolate instances of words (strings of phones) out of a long stream of phones that doesn't have gaps between each word. The stream of phones may have been poorly recognized, with letter…