6

Good evening,

I'm trying to splitting the parts of a german address string into its parts via Java. Does anyone know a regex or a library to do this? To split it like the following:

Name der Straße 25a 88489 Teststadt
to
Name der Straße|25a|88489|Teststadt

or

Teststr. 3 88489 Beispielort (Großer Kreis)
to
Teststr.|3|88489|Beispielort (Großer Kreis)

It would be perfect if the system / regex would still work if parts like the zip code or the city are missing.

Is there any regex or library out there with which I could archive this?

EDIT: Rule for german addresses:
Street: Characters, numbers and spaces
House no: Number and any characters (or space) until a series of numbers (zip) (at least in these examples)
Zip: 5 digits
Place or City: The rest maybe also with spaces, commas or braces

Matt
  • 22,721
  • 17
  • 71
  • 112
Christian Kolb
  • 1,368
  • 2
  • 23
  • 43
  • For those of unfamiliar with German addresses, what is the rule? Is it "something with spaces but not numbers", "something with numbers but no spaces", "numbers and no spaces", "no numbers and no spaces"? – Oliver Charlesworth Mar 25 '12 at 20:16
  • you don't need a regex for this. Just split the string using a space delimiter and then join it using the bar `|` delimeter - but Oli's comment above is also pertinent as i am assuming that german addresses are split with spaces – Robbie Mar 25 '12 at 20:18
  • @Robbie: I can't just split them by spaces, because a street name and a city/place can contain spaces too. – Christian Kolb Mar 25 '12 at 20:22
  • Don 't think it is that easy. There are plenty of street names with spaces in them. Also, some people write '25 a' instead of '25a'. I would normaly write my adress with ',' to delimiter the parts. Are you getting the adresses from some other system in a defined format? – bert Mar 25 '12 at 20:22
  • @Christian: Ok. Then the answer to your question is: yes, this can be done with a regex. – Oliver Charlesworth Mar 25 '12 at 20:23
  • @OliCharlesworth: I thought so, I hoped there is someone who has already done it and any one of you would have a link to it ;) – Christian Kolb Mar 25 '12 at 20:26
  • @bert Yes I get it from a other system, but the addresses are unnormalized and sometimes there are parts missing. That's what makes it so complex. But I thought I can't be the first with this problem and there has to be a library or regex for this stuff. – Christian Kolb Mar 25 '12 at 20:32
  • The *normal* rule is (the one that I’m familiar with) is to delimit each part by newlines or commas. I’ve never seen the above form without either … but it shouldn’t be too hard to deal with it; it doesn’t scale, however: there are other optional parts to the address which might render this ambiguous. – Konrad Rudolph Mar 25 '12 at 20:34
  • Some people write "D 72116" instead of just "72116" especially to disambiguate from austrian and swiss locations. – Ingo Mar 26 '12 at 17:30

6 Answers6

17

I came across a similar problem and tweaked the solutions provided here a little bit and came to this solution which also works but (imo) is a little bit simpler to understand and to extend:

/^([a-zäöüß\s\d.,-]+?)\s*([\d\s]+(?:\s?[-|+/]\s?\d+)?\s*[a-z]?)?\s*(\d{5})\s*(.+)?$/i

Here are some example matches.

It can also handle missing street numbers and is easily extensible by adding special characters to the character classes.

[a-zäöüß\s\d,.-]+?                         # Street name (lazy)
[\d\s]+(?:\s?[-|+/]\s?\d+)?\s*[a-z]?)?     # Street number (optional)

After that, there has to be the zip code, which is the only part that is absolutely necessary because it's the only constant part. Everything after the zipcode is considered as the city name.

F.P
  • 17,421
  • 34
  • 123
  • 189
6

I’d start from the back since, as far as I know, a city name cannot contain numbers (but it can contain spaces (first example I’ve found: “Weil der Stadt”). Then the five-digit number before that must be the zip code.

The number (possibly followed by a single letter) before that is the street number. Note that this can also be a range. Anything before that is the street name.

Anyway, here we go:

^((?:\p{L}| |\d|\.|-)+?) (\d+(?: ?- ?\d+)? *[a-zA-Z]?) (\d{5}) ((?:\p{L}| |-)+)(?: *\(([^\)]+)\))?$

This correctly parses even arcane addresses such as “Straße des 17. Juni 23-25 a 12345 Berlin-Mitte”.

Note that this doesn’t work with address extensions (such as “Gartenhaus” or “c/o …”). I have no clue how to handle those. I rather doubt that there’s a viable regular expression to express all this.

As you can see, this is a quite complex regular expression with lots of capture groups. If I would use such an expression in code, I would use named captures (Java 7 supports them) and break the expression up into smaller morsels using the x flag. Unfortunately, Java doesn’t support this. This s*cks because it effectively renders complex regular expressions unusable.

Still, here’s a somewhat more legible regular expression:

^
(?<street>(?:\p{L}|\ |\d|\.|-)+?)\ 
(?<number>\d+(?:\ ?-\ ?\d+)?\ *[a-zA-Z]?)\ 
(?<zip>\d{5})\ 
(?<city>(?:\p{L}|\ |-)+)
(?:\ *\((?<suffix>[^\)]+)\))?
$

In Java 7, the closest we can achieve is this (untested; may contain typos):

String pattern =
    "^" +
    "(?<street>(?:\\p{L}| |\\d|\\.|-)+?) " +
    "(?<number>\\d+(?: ?- ?\\d+)? *[a-zA-Z]?) " +
    "(?<zip>\\d{5}) " +
    "(?<city>(?:\\p{L}| |-)+)" +
    "(?: *\\((?<suffix>[^\\)]+)\\))?" +
    "$";
Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • Works like a charm for your street string, but unfortunately not for the example street. You can test it with the code example of radzio and your Regex (with escaped parts) `^((?:\\p{L}| |\\d|\\.|-)+?) ((?:\\d+ ?- ?)\\d+ *[a-zA-Z]?) (\\d{5}) ((?:\\p{L}| |-)+)$` – Christian Kolb Mar 26 '12 at 07:32
  • @Christian Because I’m an idiot. Try again, I wrote the street number range the wrong way round and didn’t make it optional. It works on the example addresses now. – Konrad Rudolph Mar 26 '12 at 08:41
  • Found one missing part. Sometimes there are braces "(" and ")" in the place which define the area. They aren't part of an official address, but they are found in such data. Would it be possible to add this to the regex? And if, how? – Christian Kolb Mar 26 '12 at 15:52
  • @Christian Hmm, where exactly would you have the parentheses? Please edit the question with an example. – Konrad Rudolph Mar 26 '12 at 15:55
  • Is it possible to say the place can consist of any characters at all, except numbers? Because I found the following address: "Schwabacher Str. 22 48516 Eschborn, Taunus". A address with a comma ",". This won't work with the regex. – Christian Kolb Mar 26 '12 at 17:25
  • Sorry to bring up so much, but it also seems not to work with . or / in the Street when there is no space between them, like the following "Hölzlestr.44/1". And would it be possible to make the house no optional? Sometimes there is just one house in a street and then they don't put a house no. Sorry again for asking for so much and thanks again for all your help. – Christian Kolb Mar 26 '12 at 17:37
  • Oh my. But yes, this should work. Just adapt the “city” part of the regular expression and either add the comma to the alternatives, or replace all those alternatives by a wildcard – in fact, city and suffic could probably be merged by just saying `(?.+)` in their place. My original expression is quite careful in what it allows and what it doesn’t allow. – Konrad Rudolph Mar 26 '12 at 17:38
  • @Christian I think now you’ve reached the end of what’s possible with regular expressions, this is inherently ambiguous. What does “Straße 1 00000 Stadt” mean? A street called “Straße 1” with just one house? Or a house no. 1 in a street called “Straße”? I don’t see how to distinguish without machine learning. Surely not with regular expressions. But the space after the street name can be made optional. Just put a question mark on the space after the “street” part of the regular expression. – Konrad Rudolph Mar 26 '12 at 17:42
  • Yes you're right. A regex isn't a brain :) But one more thing should be possible: Using "/" in a house no. Like "Hölzlestr.44/1". What do I have to change in the regex to let it work with it? – Christian Kolb Mar 26 '12 at 17:54
  • @Christian Hmm. House numbers already allow ranges. Just replace the ` ?- ?` (note the spaces) part with ` ?[-|] ?` … this should work. – Konrad Rudolph Mar 26 '12 at 17:59
2

Here is my suggestion which could be fine-tuned further e.g. to allow missing parts.

Regex Pattern:

^([^0-9]+) ([0-9]+.*?) ([0-9]{5}) (.*)$
  • Group 1: Street
  • Group 2: House no.
  • Group 3: ZIP
  • Group 4: City
Michael Schmeißer
  • 3,407
  • 1
  • 19
  • 32
  • This regex is broken since street names *can contain numbers*. For instance (but not exclusively), streets can be numbered before being assigned names so that you end up with “Straße 42”. Another example is the “Straße des 17. Juni”. – Konrad Rudolph Mar 25 '12 at 20:35
  • OP didn't mention numbers in street names. Maybe not necessary to hold for those? – keyser Mar 25 '12 at 20:42
  • @KonradRudolph You're right. That's a possibility which completely slipped my mind. Is there a "system" with which you can define how a german address is build? – Christian Kolb Mar 25 '12 at 20:44
  • @KonradRudolph The question clearly defines the street part as "Street: Characters and spaces until a number" so my regex is **not** broken. I just answered the question. If, as Christian confirmed, the question is not correct considering this part, a different solution may be needed, but I would ask Christian to change the question then as well. – Michael Schmeißer Mar 25 '12 at 21:43
  • @Michael Yes, the question didn’t specify the input correctly. The regex is *still* broken on realistic input. Not your fault, still true. – Konrad Rudolph Mar 25 '12 at 23:55
  • @KonradRudolph We have a different understanding of "broken" then. To me, something which fulfills the specification (the question in this case) is not broken. – Michael Schmeißer Mar 26 '12 at 07:05
  • @Michael Indeed it seems this way. If the specs are clearly broken, so is the code fulfilling the specs. After all, we program for the real world, not for some idealistic model described on paper. – Konrad Rudolph Mar 26 '12 at 08:37
  • @KonradRudolph Many software developers program to fulfill contracts (based on a specification) which have been made with customers who want to get the software. It often happens then that the customer did not say what they actually wanted in their "real world". The programmer, however, cannot arbitrarily guess what the customer may want to get, because if they guess wrong, they will cause significant losses to the company. So I would not take a guess against my specification. The most I would do is ask for clarification if it seems really weird. – Michael Schmeißer Mar 26 '12 at 19:58
1
public static void main(String[] args) {
    String data = "Name der Strase 25a 88489 Teststadt";
    String regexp = "([ a-zA-z]+) ([\\w]+) (\\d+) ([a-zA-Z]+)";

    Pattern pattern = Pattern.compile(regexp);
    Matcher matcher = pattern.matcher(data);
    boolean matchFound = matcher.find();

    if (matchFound) {
        // Get all groups for this match
        for (int i=0; i<=matcher.groupCount(); i++) {
            String groupStr = matcher.group(i);
            System.out.println(groupStr);
        }
    }System.out.println("nothing found");
                }

I guess it doesn't work with german umlauts but you can fix this on your own. Anyway it's a good startup.

I recommend to visit this it's a great site about regular expressions. Good luck!

Radek Busz
  • 313
  • 4
  • 12
0

At first glance it looks like a simple whitespace would do it, however looking closer I notice the address always has 4 parts, and the first part can have whitespace.

What I would do is something like this (psudeocode):

address[4] = empty
split[?] = address_string.split(" ")
address[3] = split[last]
address[2] = split[last - 1]
address[1] = split[last - 2]
address[0] = join split[first] through split[last - 3] with whitespace, trim trailing whitespace with trim()

However, this will only handle one form of address. If addresses are written multiple ways it could be much more tricky.

vgel
  • 3,225
  • 1
  • 21
  • 35
0

try this:

^[^\d]+[\d\w]+(\s)\d+(\s).*$

It captures groups for each of the spaces that delimits 1 of the 4 sections of the address

OR

this one gives you groups for each of the address parts:

^([^\d]+)([\d\w]+)\s(\d+)\s(.*)$

I don't know java, so not sure the exact code to use for replacing captured groups.

Robbie
  • 18,750
  • 4
  • 41
  • 45