1

I'm trying to use a shortest path function to find the distance between strings in a graph. The problem is that sometimes there are close matches that I want to count. For example, I would like "communication" to count as "communications" or "networking device" to count as "network device". Is there a way to do this in python? (e.g., extract the root of words, or compute a string distance, or perhaps a python library that already have word-form relationships like plural/gerund/misspelled/etc) My problem right now is that my process only works when there is an exact match for every item in my database, which is difficult to keep clean.

For example:

List_of_tags_in_graph = ['A', 'list', 'of', 'tags', 'in', 'graph']

given_tag = 'lists'

if min_fuzzy_string_distance_measure(given_tag, List_of_tags_in_graph) < threshold :
     index_of_min = index_of_min_fuzzy_match(given_tag, List_of_tags_in_graph)
     given_tag = List_of_tags_in_graph[index_of_min]

#... then use given_tag in the graph calculation because now I know it matches ...

Any thought on easy or quick way to do this? Or, perhaps a different way to think about accepting close-match strongs ... or perhaps just better error handling when strings don't match?

1 Answers1

0

Try using nltk WorldNetLemmatizer, it is designed to extract root of words. https://www.nltk.org/_modules/nltk/stem/wordnet.html

Bruno Mello
  • 4,448
  • 1
  • 9
  • 39
  • this works well for what I wanted to do... thanks! (although it was very painful to download the wordnet dictionary required through my company's firewall) – Kenneth Crowther Nov 06 '19 at 19:51