0

I need a python package that could get the related sentence from a text, based on the keywords provided.

For example, below is the Wikipedia page of J.J Oppenheimer -

Early life

Childhood and education
J. Robert Oppenheimer was born in New York City on April 22, 1904,[note 1][7] to Julius Oppenheimer, a wealthy Jewish textile importer who had immigrated to the United States from Germany in 1888, and Ella Friedman, a painter. 
Julius came to the United States with no money, no baccalaureate studies, and no knowledge of the English language. He got a job in a textile company and within a decade was an executive with the company. Ella was from Baltimore.[8] The Oppenheimer were non-observant Ashkenazi Jews.[9] 

The first atomic bomb was successfully detonated on July 16, 1945, in the Trinity test in New Mexico. 
Oppenheimer later remarked that it brought to mind words from the Bhagavad Gita: "Now I am become Death, the destroyer of worlds.

If my passed string is - "JJ Oppenheimer birth date", it should return "J. Robert Oppenheimer was born in New York City on April 22, 1904"

If my passed string is - "JJ Openheimer Trinity test", it should return "The first atomic bomb was successfully detonated on July 16, 1945, in the Trinity test in New Mexico"

I tried searching a lot but nothing comes closer to what I want and I don't know much about NLP vectorization techniques. It would be great if someone please suggest some package if they know(or exist).

MaxUU
  • 75
  • 1
  • 7

3 Answers3

1

I am pretty sure a Module exists that could do this for you, you could try and make it yourself by parsing through the text and creating words like: ["date of birth", "born", "birth date", etc] and you do this for multiple fields. This would thus allow you to find information that would be available.

The idea is:

you grab your text or whatever u have,

you grab what you are looking for (example date of birth)

You then assign a date of birth to a list of similar words,

you look through ur file to see if you find a sentence that has that in it.

I am pretty sure there is no module, maybe I am wrong but smth like this should work.

CSman
  • 21
  • 3
1

You could use fuzzywuzzy.

fuzz.ratio(search_text, sentence). 

This gives you a score of how similar two strings are.

https://github.com/seatgeek/fuzzywuzzy

ssnk001
  • 170
  • 5
  • If i am not wrong, wouldn't it compare two string and return the score ? What if I have a complete text? Should I sentence tokenized it and loop through the list comparing with each of the sentences and then at last chose the one with maximum score? – MaxUU Apr 12 '21 at 21:37
  • There're probably better ways to do it, but yeah to start I'd go that route. It also provides a module called process where you can do process.extract(search_text, sentences_to_search), where senteces to search is a list of sentences. This will return the top N sentences with the highest scores (you can set N and the scoring method to use) – ssnk001 Apr 12 '21 at 21:48
0

The task you describe looks like Information Retrieval. Given a query (the keywords) the model should return a list of document (the sentences) that best matches the query.

This is essentially what the response using fuzzywuzzy is suggesting. But maybe just counting the number of occurences of the query words in each sentence is enough (and more efficient).

The next step would be to use Tf-Idf. It is a weighting scheme, that gives high scores to words that are specific to a document with respect to a set of document (the corpus).

This results in every document having a vector associated, you will then be able to sort the documents according to their similarity to the query vector. SO Answer to do that

ygorg
  • 750
  • 3
  • 11