1

I'm trying hard to write a Python regex code for extracting German address as show below.

Abc Gmbh Ensisheimer Straße 6-8 79346 Endingen

Def Gmbh Keltenstr . 16 77971 Kippenheim Deutschland

Ghi Deutschland Gmbh 53169 Bonn

Jkl Gmbh Ensisheimer Str . 6 -8 79346 Endingen

I wrote the below code for extracting individual address components and also put them together as a single regex but still unable to detect the above addresses. Can anyone please help me with it?

# TEST COMPANY NAME
string = 'Telekom Deutschland Gmbh 53169 Bonn Datum'
result = re.findall(r'([a-zA-Zäöüß]+\s*?[A-Za-zäöüß]+\s*?[A-Za-zäöüß]?)',string,re.MULTILINE)
print(result)

# TEST STREET NAME
result = re.findall(r'([a-zA-Zäöüß]+\s*\.)',string)
print(result)

# TEST STREET NUMBER
result = re.findall(r'(\d{1,3}\s*[a-zA-Z]?[+|-]?\s*[\d{1,3}]?)',string)
print(result)

# TEST POSTAL CODE
result = re.findall(r'(\d{5})',string)
print(result)

# TEST CITY NAME
result = re.findall(r'([A-Za-z]+)?',string)
print(result)

# TEST COMBINED ADDRESS COMPONENTS GROUP
result = re.findall(r'([a-zA-Zäöüß]+\s+?[A-Za-zäöüß]+\s+?[A-Za-zäöüß]+\s+([a-zA-Zäöüß]+\s*\.)+?\s+(\d{1,3}\s*[a-zA-Z]?[+|-]?\s*[\d{1,3}]?)+\s+(\d{5})+\s+([A-Za-z]+))',string)
print(result)

Please note that my objective is that if any of these addresses are present in a huge paragraph of text then the regex should extract and print only the addresses. Can someone please help me?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Richie
  • 135
  • 1
  • 3
  • 12
  • 3
    This is close to impossible unless there are some other properties/markers in the text that suggest where an address starts, especially if addresses may omit parts like street names. I don't suppose you're only looking for "GmbH"es? – Tim Pietzcker Sep 10 '18 at 10:43
  • 2
    Provide _real_ text for this - extracting adresses from arbitrary text is neight to impossible. Your example only has `GmbH` in all its adresses - is the same for your data? Do you want `UG`s as well? What about `eGmbH` and `gGmbH`. Are you able to use some kind of other lists/dicts to enhace this (f.e. a list of all german towns)? Names for firms do not have to follow strict rules, you could create a `it4you GmbH` if you had the 25k € and expenses - or `Richies Warengesellschaft mbH` - wich would be legal as well – Patrick Artner Sep 10 '18 at 10:45
  • Hi @TimPietzcker Actually there is one solution for it in Java and its really helpful. The below is the code. I'm trying to convert the same into Python. Please refer the below code – Richie Sep 10 '18 at 10:47
  • `final Pattern pattern = Pattern.compile("([a-zA-Zäöüß]+\\s+?[A-Za-zäöüß]+)"// company name + "[\\s*.:\\/#,-]+?"// characters + "([a-zA-Zäöüß\\s\\d.,-]+)"// street name + "([\\d\\s]+(?:\\s?[-|+/]\\s?\\d+)?\\s*[a-z]?)?"// street code + "[\\s*.:\\/#,-]+?"// characters + "([A-Za-z]?[\\s*.:\\/#,-]?\\d{5})"// postal code + "[\\s*.:\\/#,]+?"// characters + "([A-Za-z]+)?");// city name` – Richie Sep 10 '18 at 10:48
  • Hi @PatrickArtner, Please read my above comment. There is a solution for it in java done by someone else here. I'm trying to convert it to Python. – Richie Sep 10 '18 at 10:49
  • Hi, Please refer this link. I'm trying to recreate this solution in Python [https://stackoverflow.com/questions/9863630/regex-for-splitting-a-german-address-into-its-parts] – Richie Sep 10 '18 at 10:51
  • It seems you already converted it to Python - your `# TEST COMBINED ADDRESS COMPONENTS GROUP` looks like it, so if that doesn't work, it's probably because of the reasons Patrick and I stated. – Tim Pietzcker Sep 10 '18 at 10:53
  • @TimPietzcker. Yes I tried my best but when I executed the java code it seems to capture the address, but in my Python code its returning EMPTY STRING for the same input. – Richie Sep 10 '18 at 10:55
  • 2
    Then there probably is a typo somewhere. Have you tried https://regex101.com - you can try your regex on your text and debug it. Unless you share your text with us, we can't really do that for you... – Tim Pietzcker Sep 10 '18 at 11:00
  • 1
    `"Street Clean Gmbh found a lot of dog poo in a Straße 16000 to be exact."` no address there, only a sentence, but regex will match. – Cid Sep 10 '18 at 11:05
  • 1
    The problem with regex is not to get it to match - it is to refine it until almost no wrong hits come through. Your specification is far too loose to make this happen. `Postfach 20421` would be another addition that could happen for german adresses and whoops - you'll get into trouble. Mannheim f,e, has Blocknumbers, not street names. Housenumbers in Bavaria can contain letters as well as numbers - more trouble. (25h is possible as well, and more common then C3 . – Patrick Artner Sep 10 '18 at 11:19
  • @PatrickArtner in france aswell, 3b and 3t is valid adress number (there's a 3, a 3 bis and a 3 ter in the same street) – Cid Sep 10 '18 at 11:22
  • 1
    I'd like to add a valid german address like "Quadrate-Buchhandlung R1, 7 68161 Mannheim". Regex? No chance! – Matthias Sep 10 '18 at 12:08
  • Hi @TimPietzcker, My text can be anything. Its just that if the address is present among some text then it should be extracted. That's it. If you want an example please refer the below text. _This is a test to extract only German addresses. This is the address - Abc Gmbh Ensisheimer Straße 6-8 79346 Endingen. The address will be clear. I hope the address has been extracted from this text. etc etc etc..._ – Richie Sep 11 '18 at 07:11
  • @Cid Luckily I won't get any invalid address format like the one you commented above. It will be a valid company address. I just added abc,def,etc to not mention the company name. All other address components will be a valid one. – Richie Sep 11 '18 at 07:14
  • @PatrickArtner, For the moment we don't have postfach type of address... just the ones I mentioned above. – Richie Sep 11 '18 at 07:15
  • Hi @Matthias, Thank you for adding another address and I forgot to mention to all that I have written a filer method to replace all special characters like "-" or ",",etc. So I tested your address with the Java code as mentioned in previous comments above. It gets detected in Java..! But still some problem in python regex.. :( – Richie Sep 11 '18 at 07:19

1 Answers1

2

I would opt against a regex solution and use libpostal instead, it has bindings for a couple of other languages (in your case for python, use postal). You will have to install libpostal separately, since it includes 1.8GB of training data.

The good thing is, you can give it address parts in any order, it will pick the right parts most of the time. It uses machine learning, trained on OpenStreetMap data in many languages.

For the examples given, it would not necessarily require to cut the company name and country from the string:

from postal.parser import parse_address
parse_address('Telekom Deutschland Gmbh 53169 Bonn Datum')

[('telekom deutschland gmbh', 'house'),
 ('53169', 'postcode'),
 ('bonn', 'city'),
 ('datum', 'house')]

parse_address('Keltenstr . 16 77971 Kippenheim')

[('keltenstr', 'road'),
 ('16', 'house_number'),
 ('77971', 'postcode'),
 ('kippenheim', 'city')]
ENOTTY
  • 394
  • 5
  • 18