3

I'm looking for a Regex pattern to find German addresses.
The problem is that the format is a bit odd, and changes frequently, samples:

Falcken Str. 45 F
Heinrich-Heine-Straße 62A, Berlin-Kreuzberg
Lindenstrasse 113; Kreuzberg; 10969 Berlin
Erkstrasse 7; Neuköln; 12043 Berlin
Werbellin Strasse 69; Neuköln; 12053 Berlin
Anschrift; Rudolfstrasse 8-10; Friedrichshain; 10245 Berlin
Maybachufer 3, Neukölln, 12047, Berlin, Germany (?)
Skalitzer Strasse 31-32; Kreuzberg; 10999 Berlin
Mühlen Strasse 17; Friedrichshain; 10243 Berlin
Am Flutgraben 1; Treptow; 12435 Berlin; Germany (?)
Rigaer Strasse 89; Friedrichshain; 10247 Berlin
Köpenicker Str. 12, 10997 Berlin-Kreuzberg
Schliemannstraße 27; 10437; Berlin
Michaelkirchstr. 32, 10179 Berlin
Maybachufer 44, Neukölln, 12045, Berlin, Germany
Alexanderstrasse 11; Mitte; 10178 Berlin
Café Dritter Raum - Hertzbergstr. 14 - 12055 Berlin

Now I've tried to divide them to groups (at least [Address] [zipcode] [berlin])
but I couldn't catch all of them, the best I could come up with was

^([a-zäöüß\s\d.,-]+?)\s*([\d\s]+(?:\s?[-|+/]\s?\d+)?\s*[a-z]?)?;*\s*(\d{5})\s*(.+)?$

(thanks to another question on stackoverflow).

Any ideas?

Asaf
  • 8,106
  • 19
  • 66
  • 116
  • so you want to seperatie Germanfrom non German addresses? Then provide some sample input from lines to match and waht to match. If you want to extract certain information from them then specify this also. – buckley Jun 07 '12 at 20:50
  • I want to separate them to groups (address,city and zipcode) in order to insert to DB – Asaf Jun 07 '12 at 20:53
  • Ok, and now for the first part of my question or do you let your regex loose on only German addresses? – buckley Jun 07 '12 at 20:54
  • 1
    Regex are not magic, one needs to know the format, which does not seem to be well defined here. You could match the postal code easy enough, and get the substrings before and after it, but other than that it gets comlicated. – Qtax Jun 07 '12 at 20:55
  • Are you sure you should post a list of actual addresses here? – Junuxx Jun 07 '12 at 21:25
  • it's random addresses, the format is the same though. – Asaf Jun 08 '12 at 09:01

1 Answers1

3

Irregular data leads to inconsistent results. In addition, regular expressions are not the right hammer for every crystal decanter.

From a practical point of view, I'd just parse the standardized addresses (whatever that means for German addresses), and dump the leftovers to another file for manual address correction. If most of your addresses are malformed, then you might need to get access to an address-correction database of some sort--usually commercial, and often available from the postal service involved.

Todd A. Jacobs
  • 81,402
  • 15
  • 141
  • 199