Regular Expression to split text based on different patterns (within a single expression)

Question

I have some patterns which detect questions and splits on top of that. there are some assumptions which I'm using like:

Every pattern starts with a \n
Every pattern ends with \s+

And how I define a pattern is like:

<NUM>.
Q <NUM>.
Q <NUM>
<Q.NUM.>
<NUM>
Question <NUM>
<Example>
Problem <NUM>
Problem:
<Alphabet><Number>.
<EXAMPLE>
Example <NUM>

Someone suggested the below regex: try the demo

((Q|Question|Problem:?|Example|EXAMPLE)\.? ?\d+\.? ?|(Question|Problem:?|Example|EXAMPLE) ?)

but it captures patterns in the middle which is problematic for me because I can have Q. , Example. 2 in the middle of the string too and is not capturing <NUM>.

This list is based on priority so what I could come up with is building these many expressions and running a loop based on the priority for example:

QUESTIONS = [
    re.compile("\n\d+\."),
    re.compile("\nQ.\s*\d+\."), 
    re.compile("\nExample.\s*\d+\.")
]

but it is very inefficient. How can I club these in one expression?

HERE IS THE TEST STRING:

'TEStlabZ\nEDULABZ\nINTERNATIONAL\nLOGARITHMS AND INDICES\n\nQ.1. (A) Convert each of the following to logarithmic form.\n(i) \\( 5^{2}=25 \\)\n(ii) \\( 3^{-3}=\\frac{1}{27} \\)\n(iii) \\( (64)^{\\frac{1}{3}}=4 \\)\n(iv) \\( 6^{0}=1 \\)\n(v) \\( 10^{-2}=0.01 \\) (vi) \\( 4^{-1}=\\frac{1}{4} \\)\nAns. We know that \\( a^{b}=x \\Rightarrow b=\\log _{a} x \\)\n(i) \\( 5^{2}=25 \\quad \\therefore \\log _{5} 25=2 \\)\n(ii) \\( 3^{-3}=\\frac{1}{27} \\therefore \\log _{3}\\left(\\frac{1}{27}\\right)=-3 \\)\n(iii) \\( (64)^{\\frac{1}{3}}=4 \\therefore \\log _{64} 4=\\frac{1}{3} \\)\n(iv) \\( 6^{0}=1 \\quad \\therefore \\log _{6} 1=0 \\)\n(v) \\( 10^{-2}=0.01 \\therefore \\log _{10}(0.01)=-2 \\)\n(vi) \\( 4^{-1}=\\frac{1}{4} \\therefore \\log _{4}\\left(\\frac{1}{4}\\right)=-1 \\)\nQ.1. (B) Convert each of the following to exponential form.\n(i) \\( \\log _{3} 81=4 \\)\n(ii) \\( \\log _{8} 4=\\frac{2}{3} \\)\n(iii) \\( \\log _{2} \\frac{1}{8}=-3 \\)\n(iv) \\( \\log _{10}(0.01)=-2 \\)\n(v) \\( \\log _{5}\\left(\\frac{1}{5}\\right)=-1 \\) (vi) \\( \\log _{a} 1=0 \\)\nAns.\n(i) \\( \\log _{3} 81=4 \\quad \\therefore 3^{4}=81 \\)\n(ii) \\( \\log _{8} 4=\\frac{2}{3} \\quad \\therefore 8^{\\frac{2}{3}}=4 \\)\n(iii) \\( \\log _{2} \\frac{1}{8}=-3 \\quad \\therefore \\quad 2^{-3}=\\frac{1}{8} \\)\n(iv) \\( \\log _{10}(0.01)=-2 \\quad \\therefore \\quad 10^{-2}=0.01 \\)\n(v) \\( \\log _{5}\\left(\\frac{1}{5}\\right)=-1 \\quad \\therefore \\quad 5^{-1}=\\frac{1}{5} \\)\n(vi) \\( \\log _{a} 1=0 \\)\n\\( \\therefore a^{0}=1 \\)\nMath Class IX\n1\nQuestion Bank'

Does `(?im)^(?!$)(?:(Question|Problem:?|Example|[A-Z])[. ]?)?(\d+[. ]?)?` work as you expect? See [the regex demo](https://regex101.com/r/QDiaD0/1). — Wiktor Stribiżew, Nov 23 '22 at 09:48
@WiktorStribiżew Thanks a lot. It's working with most of the cases. Just a small help though, how could I add `\s+` condition in the pattern? Because sometimes I could get `\n2.2` or `\n2` as part of text while my question will start most likely with `\n2.` or `\n2` — Deshwal, Nov 23 '22 at 09:55
Do you want to say there MUST be any whitespace immediately on the right? Then you can add `(?=\s)` at the end of the regex, see [this regex demo](https://regex101.com/r/QDiaD0/2). Note I left out the `\n` at the start since `^` with `m` flag means the match can only occur at the start of string or after a newline. Is that fine? — Wiktor Stribiżew, Nov 23 '22 at 09:59
Please let me know if this is worth posting, I could explain the choice of patterns. — Wiktor Stribiżew, Nov 23 '22 at 11:13
Hey @WiktorStribiżew It's not working with `re.split()` with the given text in question. You can check it [out here](https://regex101.com/r/bm6aO2/1) . Question gets splitted with `re.compile("\nQ.\s*\d+\.\s+")` — Deshwal, Nov 24 '22 at 06:25
1. Never test online with string literals, only literal strings ([how to obtain it](https://ideone.com/2iCK1c) and [here is the correct demo](https://regex101.com/r/bm6aO2/2)). 2. Why `re.split`? Please share the relevant code in the question body. — Wiktor Stribiżew, Nov 24 '22 at 08:35
Oh okay! got it. Thanks. Also I'm using `re.split` to split the OCR results in respective question-answer pairs. What else can I use here? Will it be different than using `re.match` or `re.findall`. `[A-Z]]` in the regex is matching all the letters here (not needed as it's mostly `Q.1 , Q.10., Q 13, Q 23.` etc — Deshwal, Nov 24 '22 at 10:57
Try using `re.findall(r'(?m)^(?!$)((?:(?:(?i:Question|Problem:?|Example)|[A-Z])[. ]?)?(?:\d+[. ]?)?)[^\S\r\n]+(.*)', text)`. See https://regex101.com/r/QDiaD0/4 — Wiktor Stribiżew, Nov 24 '22 at 11:04
@WiktorStribiżew It's actually detecting all of the text as you can see in your demo. Almost all of the text has been detected. — Deshwal, Nov 24 '22 at 11:40
@WiktorStribiżew [The first one works](https://regex101.com/r/4VMdmV/1) with modification for `Q.1 , Q.10., Q 13, Q 23.` only **minus** matching evert alphabet of every line. — Deshwal, Nov 24 '22 at 11:58
You should not test my last regex with `Python` option at regex101, there is a bug and the site does not handle that regex well. Test with PCRE option. So, - `(?m)^(?!$)(?:((?i:Question|Problem:?|Example)|[A-Z])[. ]?)?(\d+[. ]?)?(?=\s)` - https://regex101.com/r/4VMdmV/2 works correctly? — Wiktor Stribiżew, Nov 24 '22 at 12:00
@WiktorStribiżew Can you please put it as answer and explain the regex like you usually do so that I can understand my own `regex` for `Answer` too. — Deshwal, Nov 25 '22 at 04:58

score 1 · Accepted Answer · answered Nov 25 '22 at 08:55

1

You can use

(?m)^(?!$)(?:((?i:Question|Problem:?|Example)|[A-Z])[. ]?)?(\d+[. ]?)?(?=\s)

See the regex demo.

Details:

(?m)^ - start of a line (m allows ^ to match any line start position)
(?!$) - no end of line allowed at the same location (i.e. no empty line match allowed)
(?:((?i:Question|Problem:?|Example)|[A-Z])[. ]?)? - an optional sequence of
- ((?i:Question|Problem:?|Example)|[A-Z]) - Group 1: Question, Problem, Problem: or Example case insensitively, or an uppercase letter
- [. ]? - a space or .
(\d+[. ]?)? - an optional capturing group with ID 2 matching one or more digits and then an optional . or space
(?=\s) - a positive lookahead that requires a whitespace char immediately to the right of the current location.

answered Nov 25 '22 at 08:55

Wiktor Stribiżew

607,720
39
448
563

One question, What if I have to change the `(?m)^` (which I think is for `\n` + Starting position) to something like `>\s*\n*` (means the char `>` followed by 0+ spaces OR 0+ new line chars in starting of string ) what do I have to change in the regex? It means, instead of the new line and start of line? – Deshwal Nov 27 '22 at 07:24
@Deshwal `\s` also matches `\n`, so you should not use `>\s*\n*`, you should either use `>[^\S\n]*\n*` or ` >\s*` depending on what you need. So, replace `(?m)^` with that pattern. – Wiktor Stribiżew Nov 27 '22 at 11:06
Hey, Wiktor, what do I need to change if I want to remove the line starting condition at all. It seems like my things can be in the middle of the paragraph (may or may not be preceded by `>` or `\s+`.) what do I need to change in that one? – Deshwal Dec 01 '22 at 08:49
I removed `(?m)^` and did it with removing `(?!$)` too – Deshwal Dec 01 '22 at 08:56
`^` just meant to match at the start of a line, so removing it, you can find matches anywhere inside the string. – Wiktor Stribiżew Dec 01 '22 at 09:07
Actually instead of latex, I've been working on `html`. I know, I shouldn't use `regex` to parse HTML but it's very curated from an API and I want to split this HTML based on Text only. My data looks like this: `[CBSE Marking Scheme, 2015]
Ques. 6. The length`, `
Problem. 8. The `, `

AI] Question. 9. A h` etc. I tested individual patterns on multiple data points according to book's patterns. Just want a single `regex` like you mentioned which can do it automatically for all.
– Deshwal Dec 01 '22 at 09:25

score 0 · Answer 2 · answered Nov 23 '22 at 04:08

0

No shame in just doing the dumb solution:

^(\d+\.|Q \d+\.|Q \d+|Q\.\d+\.|\d+|Question \d+|Example( \d+)?|Problem \d+|Problem:|[A-Z]\d\.|EXAMPLE)\s+

answered Nov 23 '22 at 04:08

dc-ddfe

487
1
11

Can you please explain it too so that I can make the necessary changes if required. Thanks. – Deshwal Nov 24 '22 at 11:40
It's just [first format] or [second format] or [third format] etc. For example `\d+\.` matches any number followed by a dot (like `127.` as shown in your first example). – dc-ddfe Nov 24 '22 at 23:29

Regular Expression to split text based on different patterns (within a single expression)

2 Answers2