1

Is there a way to match such pattern in a string?

Pattern(case insensitive) : "\bfactuur(nummer)" **OR** "Nr." **OR** "Nr(:)" followed by actual value "\d+" OR "d{3,4} - d{3,4}".

Nice to have (if it's actually not surreal): "Factuur\n" ignore everything inbetween "Nr." "\d+".

Tested on:

Factuur: 2018-4005 

Factuur

Nr. 90424571 

 Factuurnummer: 2019-010

 factuur : 281319261

factuurnummer: 63

Factuurnummer: 281319264565

Factuur assdg 236373

   Factuurnummer 281319265

Factuurnummer 0723 - 1345

Factuur nr. 180262

Factuurnummer : 6322232

DEMO Regex:

https://regex101.com/r/PuGrqn/37

Lieven Keersmaekers
  • 57,207
  • 13
  • 112
  • 146
rumba
  • 13
  • 2

1 Answers1

0

Your regex just needed following two changes to make it work for all your samples. You can use this regex,

\bfactuur(?:nummer|\n)?.*?(?<=\s)(\d+(?:\s*-\s*\d+)?)(?=\s|$)

Check online demo

Here are the two changes I did,

  • If factuur can be immediately followed by a newline, besides nummer, then just put \n in alternation with nummer
  • Enhance (\d+) to (\d+(?:\s*-\s*\d+)?) to enable it to not just match numbers, but to also optionally match some optional whitespace then a hyphen then again some optional whitespace finally followed by a number which is why I added (?:\s*-\s*\d+)? after \d+

Hope this solves the issues. Let me know if you have any samples further, which doesn't match.

Edit: For capturing a string like this

Factuurnummer Factuurdate 1234 3-21-2019

The pattern can be enhanced to capture multiple numbers separated by space or hyphen using this regex,

\bfactuur(?:nummer|\n)?.*?(?<=\s)(\d+(?:\s*-?\s*\d+)*)(?=\s|$)

Check this demo with additional sample data

Pushpesh Kumar Rajwanshi
  • 18,127
  • 2
  • 19
  • 36
  • 1
    thnx, I am trying to figure out how to make working with string significantly reliable using regex. I have various strings and factuurnummer is one of the fields that i want to extract. In that particular use case it works, although i guess there might be some obstacles with other strings meaning i always have to adjust patterns. Is there any workaround here? There might be tables in the initial PDF but the output string will be printed line by line. So in that case it would look like that: Factuurnummer Factuurdate 1234 3-21-2019 – rumba Mar 21 '19 at 12:47
  • @rumba: I understand when sometimes the textual data is large and you're not sure how to match/filter the data you want. But I guess you will either need to tell the rationale of in what all cases the data exist so rules can be made for coming up with a proper regex, OR you may have to provide with enough samples to cover your all cases. If you know basics of regex, some tweaks you can do yourself too and for others I can help you. This should be a good way as this will improve your skills too. – Pushpesh Kumar Rajwanshi Mar 21 '19 at 12:54
  • I am looking for the ways to add "Factuurbedrag" to existing regexp to search for value before it. factuurbedrag = totaal . Here is the one to search for Totaal (https://regex101.com/r/PuGrqn/41) I wont it to remain the same. Here is regex for Factuurbedrag (https://regex101.com/r/PuGrqn/42) it mistakenly output everything after factuurbedrag as well. How can impore it and add to the main - totaal pattern? – rumba Mar 21 '19 at 16:26
  • @rumba: I am not sure if I get your question but do you want to capture number in all cases mentioned in your input? [Check This](https://regex101.com/r/PuGrqn/44) Here it captures the number either before or after and whether it be `totaal` or `Factuurbedrag`. Let me know if this is what you wanted. – Pushpesh Kumar Rajwanshi Mar 21 '19 at 20:12
  • 1
    Exactly that's what I was trying to figure out, appreciate that – rumba Mar 21 '19 at 20:46
  • Glad I could help :) – Pushpesh Kumar Rajwanshi Mar 21 '19 at 20:51
  • Testing this regex (**https://regex101.com/r/PuGrqn/51**) I noticed that expression matches those pattern below "Should not match". Is there a way i can separate them? – rumba Mar 25 '19 at 08:13
  • @rumba: Ok, based upon your data samples, I feel your all valid matches contain a `%` hence you can just use a positive look ahead `(?=[^%]*%)` in the beginning of your regex to reject all text that doesn't contain a `%`. [Check this](https://regex101.com/r/PuGrqn/52) Let me know if you face any issues. – Pushpesh Kumar Rajwanshi Mar 25 '19 at 08:41
  • For testing purproses i tried to place ""BTW 21" in Should Match section and it was highlighted as matched, although same occurrence in Should not Match remains as not matched. **DEMO** https://regex101.com/r/PuGrqn/53 . Each line is a unique substring from different strings to get a wider coverage as you proposed to cover all possible cases. – rumba Mar 25 '19 at 12:30