Questions tagged [jaro-winkler]

An algorithm for measuring the similarity of two strings, often used for duplicate detection.

78 questions
3
votes
0 answers

ElasticSearch using Jaro-Winkler & Levenstein algorithm

I'm trying to use ElasticSearch as a data store to find some people by their name. I've tried creating an index, I added words, changed mapping but when I'm trying to find people by name with the JaroWinkler & Levenstein algorithm, it gives nothing…
3
votes
2 answers

Jaro-winkler function: why is the same score matching very similar and very different words?

I am using the jaro-winkler fuzzy matching to match names. I am trying to determine a cut-off range for the similarity score. If the names are too different, I want to exclude them for manual review. While anything below .4 seemed to be different…
akline
  • 31
  • 1
  • 2
3
votes
1 answer

Speeding up loop calculating Jaro-Winkler distance in R

I'm new here in more than one sense. First post regarding my first script in my first attempt of aquainting any programming language. In the light of that you might find this project to be overly ambitious, but hey, learning by doing has always been…
Morten Nielsen
  • 325
  • 2
  • 4
  • 19
3
votes
0 answers

How to match Amazon / CJ / Linkshare Products

I need to create a data base with Amazon, commission junction & link share API's & data feeds and then match the same products to create comparisons on product information. My problem is related to the matching process. I start by matching…
2
votes
3 answers

Jaro-Winkler string comparison function in SAS

Is there an implementation of the Jaro-Winkler string comparison in SAS? It looks like Link King has Jaro-Winkler, but I'd prefer the flexibility of calling the function myself. Thanks!
Richard Herron
  • 9,760
  • 12
  • 69
  • 116
2
votes
2 answers

Compare and link strings with different word orders / word counts

I am trying to use the recordLinkage package to link together two datasets where one dataset tends to give multiple last / middle names and the other just gives a single last name. Currently the string comparison function that's being used is the…
2
votes
0 answers

matching text with speech to text arabic

I made a speech to text applications Arabic. the result of the speech text will be compared to the existing text in the array. with string algorithms macthing Jaro-Winkler distance I've been counting the manual of all text input with text that is in…
2
votes
0 answers

What is a sensible way to combine multiple Jaro-Winkler calculations?

Let's say I am comparing two individuals, each with a first name, last name, postal code, address(line1), address(line2), and phone number. These all have varying reliability and importance for determining a match. I can generate a J-W distance for…
Daniel Paczuski Bak
  • 3,720
  • 8
  • 32
  • 78
2
votes
2 answers

Doing order by using the Jaro-Winkler distance algorithm?

I am wondering how would I be able to run a SQLite order by in this manner select * from contacts order by jarowinkler(contacts.name,'john smith'); I know Android has a bottleneck with user defined functions, do I have an alternative?
Pentium10
  • 204,586
  • 122
  • 423
  • 502
2
votes
0 answers

Memory-efficient string comparison with blocking in R

I have a record linkage problem with very large datasets(2000 entries in the A-file, ~70.000.000 entries in the B-file) and want to do a distance-based matching with the jarow-winkler algorithm in R. Both files are data.tables filled with…
C Krüger
  • 21
  • 2
2
votes
1 answer

Fast Levenshtein Distance (and Jaro Winkler) in R for numeric vectors

Is there a packagein R that contain Levenshtein Distance counting function that compute the distance for numeric vectors? All I have found are strings based. Also I am looking for a Jaro-Winkler package that do the same, but the Levenshtein distance…
POD
  • 509
  • 8
  • 20
1
vote
2 answers

FIRST() and LAST() for MATCH_RECOGNIZE

We are analyzing the streaming twitter data to find users who are posting similar (almost same) tweets over and over. I am using MATCH_RECOGNIZE for this. It is able to find the pattern, but I am not able to get the FIRST() and the LAST() values…
Saqib Ali
  • 3,953
  • 10
  • 55
  • 100
1
vote
1 answer

poetry error "'setup.py' [...] not found" when it exists

I'm migrating my packaging tool for a Python project from pipenv to poetry. However, when attempting to install jaro-winkler (using poetry add jaro-winkler), I get the following error: • Installing jaro-winkler (2.0.1.linux-x86_64): Failed …
Ian
  • 3,605
  • 4
  • 31
  • 66
1
vote
0 answers

Computing JaroWinkler Similarity for unordered and different sized dataframes

I have two dataframes extracted from two attached files. I want to compute JaroWinkler Similarity for tokens inside the files. I am using below code. from similarity.jarowinkler import JaroWinkler jarowinkler = JaroWinkler() df_gt['jarowinkler_sim']…
Pert8S
  • 582
  • 3
  • 6
  • 21
1
vote
1 answer

Applying Jaro-Winkler distance to dataframe

I have dataframe of two columns. First one is correct strings, second is corrupted. I wanna apply Jaro-Winkler distance and store it in the new third column. import pandas as pd from pyjarowinkler.distance import get_jaro_distance df =…
Arthur
  • 13
  • 1
  • 3