-2

How to find the nearest distance of two words in a text or in a large collection of text files.

For example, I want to find the nearest distance of two words - like "is" and "are" - in a text. Here what I have:

text = "is there a way to find the nearest distance of two words - like is and are - from each other."

def dis_words_text(text, word1,word2):
    import numpy as np
    ind1 = text.find(word1)
    ind2 = text.find(word2)
    dis = "at least one of the the words not in text" if -1 in (ind1,ind2) else np.abs(ind1-ind2) 
    return(dis)

dis_words_text(text, "is","are")
Output: 25

dis_words_text(text, "why","are")
Output: "at least one of the the words not in text"    

It looks like the above code considers the distance of the first "is" and "are", not the nearest distance, which should be 7 characters. Please see also Finding the position of a word in a string and How to find index of an exact word in a string in Python as references. My question here is: 1) how can I find the closest distance of two words (the number of characters between them) if words are repeated in the text, 2) the speed is also important as applied to a large number of texts.

Sam S.
  • 627
  • 1
  • 7
  • 23
  • 1
    Does this answer your question? [Determining proximity between 2 words in a sentence in Python](https://stackoverflow.com/questions/33389108/determining-proximity-between-2-words-in-a-sentence-in-python) – ranka47 Aug 10 '22 at 00:52
  • 1
    This is a standard coding question, https://www.geeksforgeeks.org/minimum-distance-between-words-of-a-string/. See if this blog helps. – ranka47 Aug 10 '22 at 00:53
  • j1-lee, the nearest distance of two words is the closet distance between two words. – Sam S. Aug 10 '22 at 01:04
  • Thanks all for your comments, the answer/link provided by ranka47 is the closet one to what I am looking for. – Sam S. Aug 10 '22 at 01:06
  • The answer provided there looks fast but considers the distance based on the number of words. Can we find the closest distance based on the number of characters not based on the number of words? – Sam S. Aug 10 '22 at 01:09

1 Answers1

0

Here is a solution to find the closest distance of two words in a text based on the number of characters:

def nearest_values_twolist(list1,list2):
    r1 = list1[0]
    r2 = list2[0]
    min_val = 1000000
    for row1 in list1:
        for row2 in list2:
            t = abs(row1 - row2)
            if t<min_val:
                min_val = t
                r1 = row1
                r2 = row2
    return(r1,r2)

def closest_distance_words(text,w1,w2):
    ind1 = [w.start(0) for w in re.finditer(r'\b'+w1+r'\b', text)]
    ind2 = [w.start(0) for w in re.finditer(r'\b'+w2+r'\b', text)]
    i1,i2 = nearest_values_twolist(ind1,ind2)
    return(abs(i2-i1))

Test:

text = "is there a way to find the nearest distance of two words - like is and are - from each other."
closest_distance_words(text,w1,w2)

Output: 7

Sam S.
  • 627
  • 1
  • 7
  • 23