Questions tagged [fuzzy-comparison]

Fuzzy comparison is the colloquial name for Approximate String matching, the technique of finding strings that match a pattern approximately (rather than exactly).

Fuzzy comparison is the colloquial name for Approximate String matching, the technique of finding strings that match a pattern approximately (rather than exactly). This problem is typically divided into two sub-problems: finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately.


Useful links


Related tags

361 questions
11
votes
3 answers

fuzzy version of stringr::str_detect for filtering dataframe

I've got a database with free text fields that I want to use to filter a data.frame or tibble. I could perhaps with lots of work create a list of all possible misspellings of my search terms that currently occur in the data (see example of all the…
10
votes
3 answers

FuzzyWuzzy error: WARNING:root:Applied processor reduces input query to empty string, all comparisons will have score 0. [Query: '/']

Trying to write a code that will compare multiple files and return the highest fuzzratio between multiple options. Problem is I'm getting an error message: WARNING:root:Applied processor reduces input query to empty string, all comparisons will have…
Hofbr
  • 868
  • 9
  • 31
10
votes
3 answers

How to group / compare similar news articles

In an app that i'm creating, I want to add functionality that groups news stories together. I want to group news stories about the same topic from different sources into the same group. For example, an article on XYZ from CNN and MSNBC would be in…
Randy
  • 597
  • 1
  • 9
  • 20
10
votes
4 answers

Easiest way to compare two files with lists of song titles

I have two lists of song titles, each in a plain text file, which are the filenames of licensed lyric files - I want to check if the shorter list titles (needle) are in the longer list (haystack). The script/app should return the list of titles in…
pbhj
  • 276
  • 3
  • 15
10
votes
0 answers

Fuzzy merging in R - seeking help to improve my code

Inspired by the experimental fuzzy_join function from the statar package I wrote a function myself which combines exact and fuzzy (by string distances) matching. The merging job I have to do is quite big (resulting into multiple string distance…
10
votes
2 answers

Using MinHash to find similarities between 2 images

I am using MinHash algorithm to find similar images between images. I have run across this post, How can I recognize slightly modified images? which pointed me to MinHash algorithm. I was using a C# implementation from this blog post, Set Similarity…
dance2die
  • 35,807
  • 39
  • 131
  • 194
9
votes
2 answers

fuzzy join with stringdist_join() in R, Error: NAs are not allowed in subscripted assignments

First of all I am sorry if my formatting is bad, this is my first time posting, (also new to programming & R) I am trying to merge two data frames together on string variables. I am merging university names, which might not match up perfectly, so I…
Brian
  • 113
  • 1
  • 5
9
votes
3 answers

Fast way to match strings with typo

I have a huge list of strings (city-names) and I want to find the name of a city even if the user makes a typo. Example User types "chcago" and the system finds "Chicago" Of course I could calculate the Levenshtein distance of the query for all…
user2033412
  • 1,950
  • 2
  • 27
  • 47
9
votes
2 answers

Comparing (similar) images with Python/PIL

I'm trying to calculate the similarity (read: Levenshtein distance) of two images, using Python 2.6 and PIL. I plan to us e the python-levenshtein library for fast comparison. Main question: What is a good strategy for comparing images? My idea is…
Attila O.
  • 15,659
  • 11
  • 54
  • 84
9
votes
4 answers

SQL and fuzzy comparison

Let's assume we have a table of People (name, surname, address, SSN, etc). We want to find all rows that are "very similar" to specified person A. I would like to implement some kind of fuzzy logic comparation of A and all rows from table People.…
running.t
  • 5,329
  • 3
  • 32
  • 50
8
votes
4 answers

Canonical URL compare in Python?

Are there any tools to do a URL compare in Python? For example, if I have http://google.com and google.com/ I'd like to know that they are likely to be the same site. If I were to construct a rule manually, I might Uppercase it, then strip off the…
user396243
7
votes
1 answer

Fuzzy Match Across Columns in R

How can I measure the degree to which names are similar in r? In other words, the degree to which a fuzzy match can be made. For example, I am working with a data frame that looks like this: Name.1 <- c("gonzalez", "wassermanschultz",…
Sharif Amlani
  • 1,138
  • 1
  • 11
  • 25
7
votes
2 answers

Fuzzy record matching with multiple columns of information

I have a question that is somewhat high level, so I'll try to be as specific as possible. I'm doing a lot of research that involves combining disparate data sets with header information that refers to the same entity, usually a company or a…
7
votes
3 answers

How can I find the best fit subsequences of a large string?

Say I have one large string and an array of substrings that when joined equal the large string (with small differences). For example (note the subtle differences between the strings): large_str = "hello, this is a long string, that may be made up of…
Josh Voigts
  • 4,114
  • 1
  • 18
  • 43
7
votes
1 answer

How to perform a fuzzy join with fuzzyjoin::difference_* in R

I'm working with two different datasets that I want to merge based on a threshold. Let's say the two dataframes look like this: library(dplyr) library(fuzzyjoin) library(lubridate) df1 = data_frame(Item=1:5, DateTime=c("2015-01-01…
tblznbits
  • 6,602
  • 6
  • 36
  • 66
1
2
3
24 25