1

I try to find addresses in different texts. It works quite well except that it also matches a word followed by a date (foobar 22.01.2012 => address: foobar 22) So I would like to improve the regex in a way that a streetnumber MUST NOT be followed by "(.|:)\d"

This is what I have:

(?<str>\b([a-zA-Z]+-*[a-zA-Z]+(-|\s)*([a-zA-Z]|-)+)\b\.?\s{1})(?<no>\d+(\s?[a-zA-Z])?\b)

A representative text:

Consultation hours
Monday, the 06.02. until Friday, the 10.02.2012 and
Monday, the 13.02. until Tuesday, the 14.02.2012,
each 14.00-15.30 o'clock, second floor,
Am Fasanengarten 12 foobar
Schlossstr. 34

What should be found?
Am Fasanengarten 12
Schlossstr. 34

What is found?
the 06
the 10
the 13
the 14
each 14
Am Fasanengarten 12
foobar // why is this a match? Without number?
Schlossstr. 34

I tried different positive/negative lookbehinds/-aheads but with no luck.

stema
  • 90,351
  • 20
  • 107
  • 135
  • What is supposed to distinguish "Am Fasanengarten 12 Scholossstr. 34" from the rest of the text? It has words composed of alphabetic characters, numbers, and a period, and each word is separated by spaces. That is also true of the text as a whole. Is the fact that the words are capitalized supposed to be significant? -- I think you can take a lesson from this: whenever you set out to write a regexp, you need to be very clear on exactly what you want to match and don't want to match. – Alex D Jan 22 '12 at 17:47

1 Answers1

1

Try this here

(?<str>\b(?:[a-zA-Z]+-*[a-zA-Z]+(?:[ \t-])*(?:[a-zA-Z]|-)+)\b\.?\s)(?<no>\d+(?:\s?[a-zA-Z])?\b)(?![.:]\d)

See it here on Regexr

The negative lookahead (?![.:]\d) at the end assures, that there is no "." and no ":" followed by \d ahead.

foobar // why is this a match? Without number?
Schlossstr. 34

This is a match because you allow \s between the words of the streetname

(?<str>\b([a-zA-Z]+-*[a-zA-Z]+(-|\s)*([a-zA-Z]|-)+)\b\.?\s{1})(?<no>\d+(\s?[a-zA-Z])?\b)
                                 ^^ here

I replaced this in my solution with [ \t-], this allows only space, Tab and hyphen.

\s is "Whitespace" and this contains also the line brake characters, because of this it matches the foobar, if you would have looked at the group, you would have seen, that it matches the address "foobar Schlossstr. 34"

stema
  • 90,351
  • 20
  • 107
  • 135