0

I am implementing a simple search engine that searches in a source data which is the 12k pieces of written-news of different topics. We assume that the search engine just have the ability to respond to:

  1. Phrase Queries that come with inside of the double-quotation marks
  2. Not Queries that come after the exclamation mark
  3. And Queries which come without any specific mark

For instance this query:

"global warming" worldwide !USA

is a query that should contain:

  1. the Phrase Query: "global warming"
  2. the And Query: worldwide
  3. not contain the Not Query: USA

The point is that the Phrase Query should come continuously in a unique piece with no other words between the words! My problem is with splitting these three types of queries using string operation of Python or re library.

I have write this piece of code for extracting Phrase Queries and Not Queries. but I have not handled to extract the And queries yet!

query = input()
phrase_query = re.findall(r'"([^"]*)"', query)
not_query = re.findall(r'!(\w+)', query)
print(phrase_query)
print(not_query)

For the input of:

"global warming" worldwide !USA

the above code returns:

['global warming']
['USA']

Which is great. However I can not extract the And Query. How can I extract the And Query: worldwide in a different list?

1 Answers1

0

If I understand the problem correct, anything that is not a part of the phase query and the not query, is part of the and query. So, we can essentially just remove the terms that come in those queries from the string and then split it to get the individual terms.

import re

data = '"global warming" worldwide !USA'

query = data
phrase_query = re.findall(r'"([^"]*)"', query)
not_query = re.findall(r'!(\w+)', query)

and_query = data[:]

for q in phrase_query:
    complete_text = '"' + q + '"'
    and_query = and_query.replace(complete_text, "")
for q in not_query:
    complete_text = "!" + q
    and_query = and_query.replace(complete_text, "")

and_query = and_query.split()


print(and_query)
print(phrase_query)
print(not_query)



So, what I am doing here is, in the first for loop, I am looping over all the phrase queries and then completing them by adding the quotes before and after, just like they would be shown in the original query. Then I will replace them with a blank string. So it would basically just remove them. After that, I am doing a similar thing with all the not queries, but this time I am adding an exclamation in the front.

Then, the remaining terms in the search are all and queries, so we can split them to get those terms individually in a list.


EDIT for a more robust solution(one that handles spaces effectively):


import re

data = '" global warming " worldwide ! USA'

query = data
phrase_query = re.findall(r'"([^"]*)"', query)
not_query = re.findall(r'!([^w+]*)', query)

and_query = data[:]

for q in phrase_query:
    complete_text = '"' + q + '"'
    and_query = and_query.replace(complete_text, "")
for q in not_query:
    complete_text = "!" + q
    and_query = and_query.replace(complete_text, "")

and_query = [answer.strip() for answer in and_query.split()]
phrase_query = [answer.strip() for answer in phrase_query]
not_query = [answer.strip() for answer in not_query]


print(and_query)
print(phrase_query)
print(not_query)


Ved Rathi
  • 327
  • 1
  • 12
  • Thanks man. That is a good solution if there is no spaces inserted between double-quotation and exclamation marks. But what is the solution if there were some spaces in between? – Alireza Tehrani Jun 28 '22 at 15:28
  • I don't think I fully understand what you mean to say. the solution works for this input ``data = '" global warming " worldwide !USA' ``. Is this what you mean to say(spaces in between the double-quotation)? – Ved Rathi Jun 28 '22 at 15:33
  • For instance this solution does not work for this query: " global warming " worldwide ! USA I inserted a single space between "!" and USA Also I need to extract Phrase Queries, Not Queries, and And Queries without the beginning and ending spaces. – Alireza Tehrani Jun 28 '22 at 15:39
  • I have edited the solution to meet your needs, please check if this works better – Ved Rathi Jun 28 '22 at 15:47
  • nope It returns like this for Phrase Query, Not Query, and And Query respectively: ['global warming'] [] ['"', 'global', 'warming', '"', 'worldwide', '!', 'USA'] – Alireza Tehrani Jun 28 '22 at 15:56
  • I guess its better to correct the re phrase which I defined for Phrase and Not Queries. Right? – Alireza Tehrani Jun 28 '22 at 15:58
  • No, When I run this query I get the result of - ['worldwide'] ['global warming'] ['USA'] Which is what we want – Ved Rathi Jun 29 '22 at 02:54
  • Here, try going on this link https://replit.com/@VedRathi/String-operation-in-python-handling-the-queries-of-a-simple#main.py and running the given repl. It matches the edited code exactly and gives the correct output – Ved Rathi Jun 29 '22 at 02:57