How to find the nearest distance of two words in a text or in a large collection of text files.
For example, I want to find the nearest distance of two words - like "is" and "are" - in a text. Here what I have:
text = "is there a way to find the nearest distance of two words - like is and are - from each other."
def dis_words_text(text, word1,word2):
import numpy as np
ind1 = text.find(word1)
ind2 = text.find(word2)
dis = "at least one of the the words not in text" if -1 in (ind1,ind2) else np.abs(ind1-ind2)
return(dis)
dis_words_text(text, "is","are")
Output: 25
dis_words_text(text, "why","are")
Output: "at least one of the the words not in text"
It looks like the above code considers the distance of the first "is" and "are", not the nearest distance, which should be 7 characters. Please see also Finding the position of a word in a string and How to find index of an exact word in a string in Python as references. My question here is: 1) how can I find the closest distance of two words (the number of characters between them) if words are repeated in the text, 2) the speed is also important as applied to a large number of texts.