I am trying to construct a function called 'and_query' that takes as input a single string, consisting of one or more words, such that the function returns a list of matching documents for the words being in the abstracts of the documents.
First, I put all the words in an inverted index with the id being the id of the document and the abstract the plain text.
inverted_index = defaultdict(set)
for (id, abstract) in Abstracts.items():
for term in preprocess(tokenize(abstract)):
inverted_index[term].add(id)
Then, I wrote a query function where finals is a list of all the matching documents.
Because it should only return documents for which every word of the function parameter has a match in the document, I used the set operation 'intersecton'.
def and_query(tokens):
documents=set()
finals = []
terms = preprocess(tokenize(tokens))
for term in terms:
for i in inverted_index[term]:
documents.add(i)
for term in terms:
temporary_set= set()
for i in inverted_index[term]:
temporary_set.add(i)
finals.extend(documents.intersection(temporary_set))
return finals
def finals_print(finals):
for final in finals:
display_summary(final)
finals_print(and_query("netherlands vaccine trial"))
However, it seems like the function is still returning documents for which only 1 word is in the abstract of the document.
does anyone know what i did wrong concerning my set operations??
(I think the fault should be anywhere in this part of the code):
for term in terms:
temporary_set= set()
for i in inverted_index[term]:
temporary_set.add(i)
finals.extend(documents.intersection(temporary_set))
return finals
Thanks in advance
basically what i want to do in short:
for word in words:
id_set_for_one_word= set()
for i in get_id_of that_word[word]:
id_set_for_one_word.add(i)
pseudo:
id_set_for_one_word intersection (id_set_of_other_words)
finals.extend( set of all intersections for all words)
and then i need the intersection of the id sets on all of these words, returning a set in which the id's are that exist for every word in words.