-2

I have a string. For example :

"This is a string.Is this a question?What is the Question? I Dont know what the question is. Can you please list out the question?" I want to extract the questions from this text using regex

what i tried

re.findall(r'(how|can|what|where|describe|who|when)(.*?)\s*\?',message,re.I|re.M))

But it gives out other things as well and if I gives the questions it separates the (how what which etc) and the rest of the question

For the above example my output is

[('is', ' is a string.Is this a question'), ('What', ' is the Question'), ('what', ' the question is. Can you please list out the question')]

Where as I want the entire question to be together.

3 Answers3

0

To have the entire question together, you should just enclose the whole pattern in parenthesis.

Here is another, simplified version:

\b([A-Z][^.!]*[?])
Maria Ivanova
  • 1,146
  • 10
  • 19
  • I get the following output after adding the () to (how|can|what|where|describe|who|when)(.*?)\s*\? – Ashish Cherian Jul 01 '16 at 09:30
  • [('is is a string.Is this a question?', 'is', ' is a string.Is this a question'), ('What is the Question?', 'What', ' is the Question'), ('what the question is. Can you please list out the question?', 'what', ' the question is. Can you please list out the question')] – Ashish Cherian Jul 01 '16 at 09:30
  • \b([A-Z][^.!]*[?]) this patter works only when the first letter is caps! You can add a-z as well. – Deca Jul 01 '16 at 09:31
  • @Deca, that's true, but I assume that a sentence would always start with a capital letter. – Maria Ivanova Jul 01 '16 at 09:33
  • @AshishCherian, you do not need to add the `(how|can|what|where|describe|who|when)(.*?)\s*\? ` You can simply use the pattern as it is. It would capture any sentence, starting with a capital letter and ending with a question mark. So you do not need to know what word it actually starts with. – Maria Ivanova Jul 01 '16 at 09:35
  • Thank you guys but this solves the problem: \s*([^.?]*(?:how|can|what|where|is|describe|who|when)[^.?]*?\s*\?) – Ashish Cherian Jul 01 '16 at 09:38
0

It's totally impractical to search for key words when determining whether a sentence is a question. Given your list: how|can|what|where|describe|who|when, I can easily write sentences containing one of those words, which are not questions!

There are many ways you could tackle matching a sentence. For example, taking this as a baseline:

^\s*[A-Za-z,;'"\s]+[.?!]$

We could first alter it to match multiple sentences in the same string:

(^|(?<=[.?!]))\s*[A-Za-z,;'"\s]+[.?!]

This uses a look-behind to ensure that a sentence has just finished (unless we're at the start of the string).

And then adjust it to match only sentences which end with ?:

(^|(?<=[.?!]))\s*[A-Za-z,;'"\s]+\?

Here is an online demo of my regex, on your original string.

Community
  • 1
  • 1
Tom Lord
  • 27,404
  • 4
  • 50
  • 77
  • https://regex101.com/r/rT1mQ0/4 – Ashish Cherian Jul 01 '16 at 09:52
  • An interrogative word or question word is a function word used to ask a question, such as what, when, where, who, whom, why, and how. They are sometimes called wh-words, because in English most of them start with wh- (compare Five Ws). They may be used in both direct questions (Where is he going?) and in indirect questions (I wonder where he is going). In English and various other languages the same forms are also used as relative pronouns in certain relative clauses (The country where he was born) and certain adverb clauses (I go where he goes). – Ashish Cherian Jul 01 '16 at 09:53
  • https://en.wikipedia.org/wiki/Interrogative_word – Ashish Cherian Jul 01 '16 at 09:53
  • @AshishCherian My point is simply that regex cannot be used reliably to match on such words. for example: *"When I'm hungry, I eat." "What I said was correct."* The only reliable indicator of a question is the terminating `?` character. – Tom Lord Jul 01 '16 at 09:57
  • There are various ways you could extend this, but I didn't want to over-complicate it for your needs. For example, you may want to include closing brackets in the "end of sentence" characters: `(^|(?<=[.?!)]))\s*[A-Za-z,;'"\s]+\?` -- https://regex101.com/r/rT1mQ0/5 – Tom Lord Jul 01 '16 at 10:00
  • Also, you can arguably write [questions without a `?` character](https://en.wikipedia.org/wiki/Question) - such as *"Tell me our name."* -- If you wish to also include these, then regex is not a viable option. – Tom Lord Jul 01 '16 at 10:05
  • thank you Your code works fine however it cant seem to tackle a string like this `['Categories\\t\\t\se time - HTTP compression\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\tHow to troubleshoot sudden CPU spikes?']` – Ashish Cherian Jul 01 '16 at 13:39
  • OK, so you could tweak it further by defining the start of a sentence to also include "after a tab": `(^|(?<=[.?!)\t]))\s*[A-Za-z,;'"\s]+\?`. And perhaps you'd also like to inlude hyphens in the allowed list of "sentence characters": `(^|(?<=[.?!)\t]))\s*[A-Za-z,;'"\s-]+\?`. Don't take this as an absolute, definitive answer; play around with it to suit your needs. – Tom Lord Jul 01 '16 at 13:49
0

Thank you for helping me out the answer was provided by @Fredrik and can be found here https://regex101.com/r/rT1mQ0/2

\s*([^.?]*(?:how|can|what|where|describe|who|when)[^.?]*?\s*\?)