Questions tagged [sequencematcher]

For questions pertaining to SequenceMatcher from the python difflib module. This is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable. difflib is part of the python standard library.

Documentation

72 questions
2
votes
2 answers

Is there an equivalent to pythons's SequenceMatcher in SQL Server to join on columns that are similar?

In python there a nice built in function that lets me check the difference between the sequence of two strings. Example below: from difflib import SequenceMatcher def similar(a, b): return SequenceMatcher(None, a,…
CandleWax
  • 2,159
  • 2
  • 28
  • 46
2
votes
1 answer

Python Comparing text files for similar or equal lines

I have 2 text files, my goal is to find the lines in file First.txt that are not in Second.txt and output said lines to a third text file Missing.txt, i have that done: fn = "Missing.txt" try: fileOutPut = open(fn, 'w') except IOError: …
Fidycent
  • 21
  • 4
2
votes
1 answer

Working of methods set_seq1 and set_seq2 , difflib python

I have checked the docs of difflib and i'm confused on how difflib.SequenceMatcher.ratio() actually works. Consider this : s = difflib.SequenceMatcher(None, "hey here" , "hey there").ratio() print s gives s = 0.9411764705882353 I wanted to know…
Hypothetical Ninja
  • 3,920
  • 13
  • 49
  • 75
1
vote
2 answers

SequenceMatcher: Recording no match just once?

I am using SequenceMatcher to find a set of words within a group of texts. The problem I am having is that I need to record when it does not find a match, but one time per text. If I try an if statement, it gives me a result each time the comparison…
Connie
  • 13
  • 2
1
vote
1 answer

Print Rows that are "Near Duplicates" in Pandas DataFrame

I'm working on a personal project that performs Web Scraping on multiple databases of research articles (thus far I have done PubMed and Scopus) and extracts the titles of the articles. I've actually managed to pull this off on my own without…
TrevorM
  • 55
  • 6
1
vote
3 answers

Best way to recognize same club names that are written in a different way

for x in range(len(fclub1)-1): for y in range(x+1,len(fclub1)-1): if SequenceMatcher(None,fclub1[x], fclub1[y]).ratio() > 0.4: if SequenceMatcher(None,fclub2[x], fclub2[y]).ratio() > 0.4: …
1
vote
0 answers

How to write the output from a defined function related to opcodes into a new column in pandas dataframe

I am trying to write the output from a defined function in a new column in pandas dataframe & export it to excel, however when I open the excel I see blank values in the derived column. Example & the code used is given below. Dataframe name =…
1
vote
1 answer

Using difflib to compare a string with a row in a dataframe

I have a string email = 'joe@gmail.com' and a DF df = DataFrame({ ‘id’: [1, 2, 3], 'email_address': [‘steve@gmail.com’, ‘joe@hotmail.com’, ‘bill@hotmail.com’ ]}) I want to add a column named 'score' and score each email_address against my email…
RiotF
  • 71
  • 5
1
vote
0 answers

fuzzy wuzzy token sort vs difflib Sequence matcher

I am trying to figure out the difference between the two. I get the same results(similarity scores) using the two for the same strings. Can somebody please explain the difference between the two using the formula for each of them? Any idea if one…
Samit Saxena
  • 99
  • 1
  • 9
1
vote
3 answers

How to detect sequences in a interleaved log file

I would like to match patterns from a given pattern library, returning the longest detected patterns. However I only have the interleaved result of multiple parallel tasks in a log file, e.g. from multiple cores of a processor. Is this a known…
1
vote
1 answer

Finding all similar values in pandas using SequenceMatcher Python

I'm trying to filter on a specific value in pandas in a column but also allow for typing mistakes. I thought using SequenceMatcher was a good solution but I don't know what the best way is to apply it within a DataFrame. Let's say the headers are…
Hestaron
  • 190
  • 1
  • 8
1
vote
1 answer

Drop similar text rows of one column in Python

import pandas as pd from difflib import SequenceMatcher df = pd.DataFrame({"id":[9,12,13,14], "text":["Error number 609 at line 10", "Error number 609 at line 22", "Error string 'foo' at line 11", "Error string 'bar' at line…
ah bon
  • 9,293
  • 12
  • 65
  • 148
1
vote
2 answers

Find match percentage between two strings also taking intro consideration the order of the words - Python

I am looking for a way to output the match percentage while between two strings (ex: names) while also taking into consideration they might be the same but with the words in a different order. I tried using SequenceMatcher() but the results are…
calin.bule
  • 95
  • 1
  • 15
1
vote
2 answers

Difflib sequencematcher with sentences

I have the following dataframe Column1 Column2 tomato fruit tomatoes are not a fruit potato la best potatoe are some sort of fruit apple there are great benefits to appel pear peer and I would like to look up the…
PRIME
  • 73
  • 1
  • 3
  • 10
1
vote
0 answers

Sequence clustering in R

I'm trying to write a simple R sequence clustering/grouping/simplification solution. I'm rather a beginner, haven't used R for a while, so please forgive simple and stupid questions/solutions. Tasks are taken from SAP and they represent execution of…
user2433705
  • 141
  • 1
  • 10