0

I want to match an input string to a list of tuples and find out the top N close matches from the list of tuples. The list of tuple has around 2000 items. The problem I am facing is that I have used fuzzywuzzy process.extract method but it returns a huge number of tuples with the same confidence score. The quality of match is also not good. What I would like to do is get all the matches based on my input(order is not important)

Example: 
input string: 'fruit apple'
    
List of tuples = [('apple fruit', 91), ('the fruit is an apple', 34), ('banana apple', 78), ('guava tree', 11), ('delicious apple', 88)]

From here I want to find all strings from the list of strings which contains both the word 'fruit apple' in any order.

Expected output:
[('apple fruit', 91), ('the fruit is an apple', 34)]

I know with fuzzywuzzy, it's 1 line of code but the issue is when the size of the list of tuples to be checked against is very large, fuzzywuzzy assigns the same confidence score to unrelated items.

Attaching code tried till now for reference:

def preprocessing(fruit):
    stop_words = stopwords.words('english')
    fruit_string = re.sub(r'[a-z][/][a-z][/]*[a-z]{0,1}', '', fruit_string)
    fruit_string = re.sub(r'[^A-Za-z0-9\s]+', '', fruit_string)
    return ' '.join(each_word for each_word in fruit_string.split() if each_word not in stop_words and len(each_word) > 2)
    

#All possible fruit combination list
nrows=[]
with open("D:/fruits.csv", 'r') as csvfile: 
    csvreader = csv.reader(csvfile)
    fields = next(csvreader)
    for row in csvreader: 
        nrows.append(row)
        
flat_list = [item for items in nrows for item in items]        



def get_matching_fruits(input_raw_text):
    preprocessed_synonym = preprocessing(input_raw_text)
    text = nltk.word_tokenize(preprocessed_synonym)
    pos_tagged = nltk.pos_tag(text)
    nn = filter(lambda x:x[1]=='NN',pos_tagged)
    list_nn = list(nn)
    nnp = filter(lambda x:x[1]=='NNP',pos_tagged)
    list_nnp = list(nnp)
    nns = filter(lambda x:x[1]=='NNS',pos_tagged)
    list_nns = list(nns)
    comb_nouns = list_nn + list_nnp + list_nns
    input_nouns = [i[0] for i in comb_nouns]
    input_nouns= ' '.join(input_nouns)
    ratios = process.extract(input_nouns, flat_list, limit=1000)
    result = []    
    for i in ratios:
        if input_nouns in i[0]:
            result.append(i)
    return result    

get_matching_fruits('blue shaped pear was found today')

So, in my code, I want to have the result list contain all the possible matches given any input in question. Any help on this will be highly welcomed.

Erich
  • 87
  • 6
  • As your function `def get_matching_fruits(input_raw_text, n)` has two parameters, non-optional, how can this code possibly work when the call only provides one parameter `get_matching_fruits('blue shaped pear was found today')`. Please post a [mre] that works. – DisappointedByUnaccountableMod Aug 13 '20 at 16:49
  • What output does your now working code produce? – DisappointedByUnaccountableMod Aug 13 '20 at 16:51
  • So, if I return ratios from the `get_matching_fruits` function, I will be getting a huge list of tuples but they are not sorted in the format that I want. The result list that I have created was to do the same thing which I raised the question about but it isn't working and provides only the matches that are in sequence. i.e if input is 'fruit apple' only those strings will be matched and not strings having 'apple fruit' etc etc – Erich Aug 13 '20 at 16:55

2 Answers2

1

The simplest solution for me is this.

foo = 'fruit apple'
bar = [('apple fruit', 91), 
       ('the fruit is an apple', 34), 
       ('banana apple', 78), 
       ('guava tree', 11), 
       ('delicious apple', 88)]

matches = []
for entry in bar:
    for word in foo.split():
        # break if we meet a point where the word isn't found
        if word not in entry[0]:
            break
    # the else is met if we didn't break from the for loop
    else:
        matches.append(entry)

print(matches)
Axe319
  • 4,255
  • 3
  • 15
  • 31
1

Sorry if i kinda dint understand the question properly, but why do u even need an NLTK library to do this.. this is a simple list comprehension problem

In [1]: tup = [('apple fruit', 91), ('the fruit is an apple', 34), ('banana apple', 78), ('guava tree', 11), ('delicious apple', 88)]

In [2]: input_string = 'fruit apple'

In [3]: input_string_set =  set(input_string.split(' '))

In [4]: input_string_set
Out[4]: {'apple', 'fruit'}

In [10]: [t for t in tup if input_string_set.issubset(set(t[0].split(' ')))]
Out[10]: [('apple fruit', 91), ('the fruit is an apple', 34)]

In [11]: