2

So the purpose of this program is to find example sentences for each word in ner.txt. For example, if the word apple is in ner.txt then I would like to find if there is any sentence that contains the word apple and output something like apple: you should buy an apple juice.

So the logic of the code is pretty simple, as I need only one example sentence per word in ner.txt.. I am using NLTK to determine if it's a sentence or not.

The problem is at the bottom of the code. I am using 2 for loops to find example sentences for each word. This is painfully slow and not usable for large files. How can I make this efficient? or is there any better way to do this without my logic?

from nltk.tokenize import sent_tokenize

news_articles = "test.txt"
oov_ner = "ner.txt"

news_data = ""
with open(news_articles, "r") as inFile:
    news_data = inFile.read()

base_news = sent_tokenize(news_data)

with open(oov_ner, "r") as oovNER:
    oov_ner_content = oovNER.readlines()

oov_ner_data = [x.strip() for x in oov_ner_content]

my_dict = {}

for oovner in oov_ner_data:
    for news in base_news:
        if oovner in news:
            my_dict[oovner] = news
            print(my_dict)
DSMK Swab
  • 163
  • 1
  • 7
  • 1
    Hi, the best way you can do for this, first tokenize your detected sentences and then create inverted index, after that each word in just like a query to find in the inverted index. Other way you can do that, if you don't want to use inverted index, you can Product() method in itertool python module to create all pair of your sentences and word in one list and then use one loop to compare them. – armin ajdehnia May 07 '21 at 05:53

2 Answers2

1

Here is what I would do: Split up the process into two steps, index creation and lookup.

from nltk.tokenize import sent_tokenize, word_tokenize

# 1. create a reusable word index like {'worda': [2, 4, 10], 'wordb': [1, 9]}
with open("test.txt", "r", encoding="utf8") as fp:
    news_sentences = sent_tokenize(fp.read())

index = {}
for i, sentence in enumerate(news_sentences):
    for word in word_tokenize(sentence):
        word = word.lower()
        if word not in index:
            index[word] = []
        index[word].append(i)

# 2. look up words from that index and retrieve the associated sentences
with open("ner.txt", "r", encoding="utf8") as fp:
    oov_ner_data = [l.strip() for l in fp.readlines()]

matches = {}

for word in oov_ner_data:
    word = word.lower()
    if word in index:
        matches[word] = [news_sentences[i] for i in index[word]]

print(matches)

Step 1 takes however long it takes to run sent_tokenize() and word_tokenize() over your text. There is not a whole lot you can do about that. But you only need to do it once, and then can run different word lists against it very quickly.

The advantage of running both sent_tokenize() and word_tokenize() is that it prevents false positives due to partial matches. E.g., your solution would find a positive match for "bark" if the sentence contained "embark", mine would not. In other words - a faster solution that produces incorrect results isn't an improvement.

Tomalak
  • 332,285
  • 67
  • 532
  • 628
0

Instead of taking a word, i.e. the outer for loop, as you do now, I would swap the loops around and do a break when a word that matches a sentence has been found - this way you would save some time, because right now you're taking an 'oovner' and tries matching it against every single sentence 'news' in 'base_news'. If you swap the loops around you can get out once you have found a match.

This:

for oovner in oov_ner_data:
    for news in base_news:
        if oovner in news:
            my_dict[oovner] = news
            print(my_dict)

Into this:

for news in base_news:
    for oovner in oov_ner:data:
        if oovner in news:
            my_dict[oovner] = news
            print(my_dict)
            break 

I wouldn't call it optimal but it should give some sort of speed up.

garff
  • 41
  • 3