-1

so I'm new to Python and I was looking to remove partially similar entries within the same column. For example these are the entries in one of the columns in a dataframe-

Row 1 - "I have your Body Wash and I wonder if it contains animal ingredients. Also, which animal ingredients? I prefer not to use product with animal ingredients."

Row 2 - "This also doesn't have the ADA on there. Is this a fake toothpaste an imitation of yours?"

Row 3 - "I have your Body Wash and I wonder if it contains animal ingredients. I prefer not to use product with animal ingredients."

Row 4 - "I didn't see the ADA stamp on this box. I just want to make sure it was still safe to use?"

Row 5 - "Hello, I was just wondering if the new toothpaste is ADA approved? It doesn’t say on the packaging"

Row 6 - "Hello, I was just wondering if the new toothpaste is ADA approved? It doesn’t say on the box."

So in this column, rows 1&3, and rows 5&6 are similar (partial duplicates). I want python to recognize these as duplicates, retain the longer sentence and drop the shorter one and export the new data to a csv file.

Expected output - Row 1 - "I have your Body Wash and I wonder if it contains animal ingredients. Also, which animal ingredients? I prefer not to use product with animal ingredients."

Row 2 - "This also doesn't have the ADA on there. Is this a fake toothpaste an imitation of yours?"

Row 3 - "I didn't see the ADA stamp on this box. I just want to make sure it was still safe to use?"

Row 4 - "Hello, I was just wondering if the new toothpaste is ADA approved? It doesn’t say on the packaging"

I tried using FuzzyWuzzy wherein I used the similarity sort function, but it didn't give me the expected output. is there any simpler code for this?

Shrumo
  • 47
  • 7

1 Answers1

0

Here is my approach, hopefully the comments are self explanatory.

from fuzzywuzzy import fuzz,process

rows = ["I have your Body Wash and I wonder if it contains animal ingredients. Also, which animal ingredients? I prefer not to use product with animal ingredients.","This also doesn't have the ADA on there. Is this a fake toothpaste an imitation of yours?","I have your Body Wash and I wonder if it contains animal ingredients. I prefer not to use product with animal ingredients.","I didn't see the ADA stamp on this box. I just want to make sure it was still safe to use?","Hello, I was just wondering if the new toothpaste is ADA approved? It doesn’t say on the packaging","Hello, I was just wondering if the new toothpaste is ADA approved? It doesn’t say on the box."]

clean = []
threshold = 80 # this is arbitrary
for row in rows:
    # score each sentence against each other sentence
    # [('string', score),..]
    scores = process.extract(row, rows, scorer=fuzz.token_set_ratio)
    # basic idea is if there is a close second match we want to evaluate 
    # and keep the longer of the two
    if scores[1][1] > threshold:
        clean.append(max([x[0] for x in scores[:2]],key=len))
    else:
        clean.append(scores[0][0])
# remove dupes
clean = set(clean)
chris
  • 4,840
  • 5
  • 35
  • 66
  • thanks for the help, its working ! however, when i export this clean data to an excel file, its changing the order of the original file, is there a way to retain the order of the original excel column, and just give out a clean file? – Shrumo Jun 25 '21 at 05:45
  • Also, i noticed it takes a lot of time to process when i run it on a larger dataset, any way to reduce that time? – Shrumo Jun 26 '21 at 03:36
  • is there a way I can reduce runtime using this code? As of now, it takes about 45 mins-1 hour to process a dataset with 800+ rows... and when I tried to process a dataset with 2K rows, it kept running for 2 hrs and no result.. it's taking way too much time. I know about RapidFuzz library but I was hoping to make this work where we can introduce a parameter to match only 10/15 rows up and down instead of matching one row with the whole dataset. I think that might reduce the runtime. Any suggestions? – Shrumo Jul 27 '21 at 04:19