writing a AND query for to find matching documents within a dataset (python)

Question

I am trying to construct a function called 'and_query' that takes as input a single string, consisting of one or more words, such that the function returns a list of matching documents for the words being in the abstracts of the documents.

First, I put all the words in an inverted index with the id being the id of the document and the abstract the plain text.

inverted_index = defaultdict(set)

for (id, abstract) in Abstracts.items():
for term in preprocess(tokenize(abstract)):
    inverted_index[term].add(id)

Then, I wrote a query function where finals is a list of all the matching documents.

Because it should only return documents for which every word of the function parameter has a match in the document, I used the set operation 'intersecton'.

def and_query(tokens):
    documents=set()
    finals = []
    terms = preprocess(tokenize(tokens))

    for term in terms:
        for i in inverted_index[term]:
            documents.add(i)

    for term in terms:
        temporary_set= set()
        for i in inverted_index[term]:
            temporary_set.add(i)
        finals.extend(documents.intersection(temporary_set))
    return finals

def finals_print(finals):
    for final in finals:
        display_summary(final)        

finals_print(and_query("netherlands vaccine trial"))

However, it seems like the function is still returning documents for which only 1 word is in the abstract of the document.

does anyone know what i did wrong concerning my set operations??

(I think the fault should be anywhere in this part of the code):

for term in terms:
    temporary_set= set()
    for i in inverted_index[term]:
        temporary_set.add(i)
    finals.extend(documents.intersection(temporary_set))
return finals

Thanks in advance

basically what i want to do in short:

for word in words:
    id_set_for_one_word= set()
    for  i  in  get_id_of that_word[word]:
        id_set_for_one_word.add(i)
pseudo:
            id_set_for_one_word intersection (id_set_of_other_words)

finals.extend( set of all intersections for all words)

and then i need the intersection of the id sets on all of these words, returning a set in which the id's are that exist for every word in words.

Could you provide some input data to be able to test the code? — Franco Piccolo, Nov 11 '18 at 15:48
not really actually. A lot of preprocessing and other operations are performed before the data actually is being used to query on. Also a lot of modules have to be imported to make it work. gonna be a lot of work to provide that here. — Jorian Onderwater, Nov 11 '18 at 16:15
I updated my question with something in a sort of pseudocode make be somewhat more clear what i'm trying to do — Jorian Onderwater, Nov 11 '18 at 16:33
TLDR, but if you want to ‘and’ several criteria so that only abstracts matching return then I would 1. prep in advance, outside matchers. 2. call matchers in sequence, passing in the list of abstracts. 3. prune non matching abstracts within each simple matcher function. having ‘extends’ is code smell here for me. — JL Peyret, Nov 11 '18 at 17:34

JL Peyret · Answer 1 · 2018-11-11T18:02:34.313

To elaborate on my code smells comment, here's a rough draft of what I have done before to solve this kind of problems.

def tokenize(abstract):
    #return <set of words in abstract>
    set_ = .....
    return set_

candidates = (id, abstract, tokenize(abstract)) for abstract in Abstracts.items():


all_criterias = "netherlands vaccine trial".split()


def searcher(candidates, criteria, match_on_found=True):

    search_results = []
    for cand in candidates:
        #cand[2] has a set of tokens or somesuch...  abstract.
        if criteria in cand[2]:
            if match_on_found:
                search_results.append(cand)
            else:
                #that's a AND NOT if you wanted that
                search_results.append(cand)
    return search_results


for criteria in all_criterias:
    #pass in the full list every time, but it gets progressively shrunk
    candidates = searcher(candidates, criteria)

#whats left is what you want
answer = [(abs[0],abs[1]) for abs in candidates]

score 0 · Answer 2 · answered Nov 11 '18 at 20:02

Question: returns a list of matching documents for the words being in the abstracts of the documents

The term with the min number of documents, hold always the result.
If a term does not exists in inverted_index, gives no match at all.

For the sake of simplicity, predefined data:

Abstracts = {1: 'Lorem ipsum dolor sit amet,',
             2: 'consetetur sadipscing elitr,',
             3: 'sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,',
             4: 'sed diam voluptua.',
             5: 'At vero eos et accusam et justo duo dolores et ea rebum.',
             6: 'Stet clita kasd gubergren,',
             7: 'no sea takimata sanctus est Lorem ipsum dolor sit amet.',
            }


inverted_index = {'Stet': {6}, 'ipsum': {1, 7}, 'erat,': {3}, 'ut': {3}, 'dolores': {5}, 'gubergren,': {6}, 'kasd': {6}, 'ea': {5}, 'consetetur': {2}, 'sit': {1, 7}, 'nonumy': {3}, 'voluptua.': {4}, 'est': {7}, 'elitr,': {2}, 'At': {5}, 'rebum.': {5}, 'magna': {3}, 'sadipscing': {2}, 'diam': {3, 4}, 'dolore': {3}, 'sanctus': {7}, 'labore': {3}, 'sed': {3, 4}, 'takimata': {7}, 'Lorem': {1, 7}, 'invidunt': {3}, 'aliquyam': {3}, 'accusam': {5}, 'duo': {5}, 'amet.': {7}, 'et': {3, 5}, 'sea': {7}, 'dolor': {1, 7}, 'vero': {5}, 'no': {7}, 'eos': {5}, 'tempor': {3}, 'amet,': {1}, 'clita': {6}, 'justo': {5}, 'eirmod': {3}}

def and_query(tokens):
    print("tokens:{}".format(tokens))
    #terms = preprocess(tokenize(tokens))
    terms = tokens.split()

    term_min = None
    for term in terms:
        if term in inverted_index:
            # Find min
            if not term_min or term_min[0] > len(inverted_index[term]):
                term_min = (len(inverted_index[term]), term)
        else:
            # Break early, if a term is not in inverted_index
            return set()

    finals = inverted_index[term_min[1]]
    print("term_min:{} inverted_index:{}".format(term_min, finals))
    return finals


def finals_print(finals):
    if finals:
        for final in finals:
            print("Document [{}]:{}".format(final, Abstracts[final]))
    else:
        print("No matching Document found")

if __name__ == "__main__":
    for tokens in ['sed diam voluptua.', 'Lorem ipsum dolor', 'Lorem ipsum dolor test']:
        finals_print(and_query(tokens))
        print()

Output:

tokens:sed diam voluptua.
term_min:(1, 'voluptua.') inverted_index:{4}
Document [4]:sed diam voluptua.

tokens:Lorem ipsum dolor
term_min:(2, 'Lorem') inverted_index:{1, 7}
Document [1]:Lorem ipsum dolor sit amet,
Document [7]:no sea takimata sanctus est Lorem ipsum dolor sit amet.

tokens:Lorem ipsum dolor test
No matching Document found

Tested with Python: 3.4.2

score 0 · Answer 3 · answered Nov 12 '18 at 09:34

Found the solution eventually myself. replacing

    finals.extend(documents.intersection(id_set_for_one_word))
return finals

with

    documents = (documents.intersection(id_set_for_one_word))
return documents

seems to work here.

Still, thanks for all the effort y'all.

writing a AND query for to find matching documents within a dataset (python)

3 Answers3