4

As input I am getting an address as a String. It may say something like "123 Fake Street\nLos Angeles, CA 99988". How can I convert this into an object with fields like this:

Address1
Address2
City
State
Zip Code

Or something similar to this? If there is a java library that can do this, all the better.

Unfortunately, I don't have a choice about the String as input. It's part of a specification I'm trying to implement.

The input is not going to be very well structured so the code will need to be very fault tolerant. Also, the addresses could be from all over the world, but 99 out of 100 are probably in the US.

Daniel Kaplan
  • 62,768
  • 50
  • 234
  • 356
  • 1
    Are the formats of the input String always going to be the same? Do you have an example input with address2. Also are these only US addresses, or also other countries? – Alvin Bunk Oct 15 '14 at 21:24
  • Hi there; I looked into making some Regex code based on @ChrisS example, however I agree with Matt that using Regex's is hard with addresses. You might want to use something else. – Alvin Bunk Oct 15 '14 at 22:55

5 Answers5

3

You can use JGeocoder

public static void main(String[] args) {
    Map<AddressComponent, String> parsedAddr  = AddressParser.parseAddress("Google Inc, 1600 Amphitheatre Parkway, Mountain View, CA 94043");
    System.out.println(parsedAddr);

    Map<AddressComponent, String> normalizedAddr  = AddressStandardizer.normalizeParsedAddress(parsedAddr); 
    System.out.println(normalizedAddr);
  }

Output will be:

{street=Amphitheatre, city=Mountain View, number=1600, zip=94043, state=CA, name=Google Inc, type=Parkway}
{street=AMPHITHEATRE, city=MOUNTAIN VIEW, number=1600, zip=94043, state=CA, name=GOOGLE INC, type=PKWY}

There is another library International Address Parser you can check its trial version. It supports country as well.

AddressParser addressParser = AddressParser.getInstance();
AddressStandardizer standardizer = AddressStandardizer.getInstance();//if enabled
AddressFormater formater = AddressFormater.getInstance();

String rawAddress = "101 Avenue des Champs-Elysées 75008 Paris";

//you can try to detect the country
CountryDetector detector = CountryDetector.getInstance();
String countryCode = detector.getCountryCode("7580 Commerce Center Dr ALABAMA");
System.out.println("detected country=" + countryCode);

Also, please check Implemented Countries in this library.

Cheers !!

Sachin Thapa
  • 3,559
  • 4
  • 24
  • 42
2

I work at SmartyStreets where we develop address parsing and extraction algorithms.

It's hard.

If most of your addresses are in the US, you can use an address verification service to provide guaranteed accurate parse results (since the addresses are checked against a master list).

There are several providers out there, so take a look around and find one that suits you. Since you probably won't be able to install the database locally (not without a big fee, because address data is licensed by the USPS), look for one that offers a REST endpoint so you can just make an HTTP request. Since it sounds like you have a lot of addresses, make sure the API is high-performing and lets you do batch requests.

For example, with ours:

Input:

13001 Point Richmond Dr NW, Gig Harbor WA

Output:

Address verified

Or the more specific breakdown of components, if needed:

components

If the input is even messier, there are a few address extraction services available that can handle a little bit of noise within an address and parse addresses out of text and turn them into their components. (SmartyStreets offers this also, as a beta API. I believe some other NLP services do similar things too.)

Granted, this only works for US addresses. I'm not as expert on UK or Canadian addresses, but I believe they may be slightly simpler in general.

(Beyond a small handful of well-developed countries, international data is really hit-and-miss. Reliable data sets are hard to obtain or don't exist. But if you're on a really tight budget you could write your own parser for all the address formats.)

Matt
  • 22,721
  • 17
  • 71
  • 112
1

If you are sure on the format, you can use regular expressions to get the address out of the string. For the example you provided something like this:

String address = "123 Fake Street\\nLos Angeles, CA 99988";     
String[] parts = address.split("(.*)\\n(.*), ([A-Z]{2}) ([0-9]{5})");
Chris Stillwell
  • 10,266
  • 10
  • 67
  • 77
0

I assume the sequence of information is always the same, as in the user will never enter postal code before State. If I got your question correctly you need logic to process afdress that may be incomplete (like missing a portion). One way to do it is look for portions of string you know are correct. You can treat the known parts of Address as separators. You will need City and State names and address words (Such as "Street", "Avenue", "Road" etc) in an array.

  1. Perform Index of with cities,states and the address words (and store them).
  2. Substring and cut out the 1st line of address (from start to the index of address signifying word +it's length).
  3. Check index of city name (index found in step 1). If it's -1 skip this step. If it's 0 Take it out (0 also means address line 2 is not in string). If it's more than 0, Substring and cut out anything from start of string to index of city name as the 2nd line of address.
  4. Check the index of state name. Once again if -1 skip this step. If 0 substring and cut out as state name.
  5. Whatever remains is your postal code
  6. Check the strings you just extracted for left over separators (commas, dots, new lines etc) and extract them;

If the address is missing both state and city you would actually need an a list of zip codes too, so better ensure the user enters at least 1 of them.

It's not impossible to implement what you need, but you probably don't want to waste all that time doing it. It's easier to just ensure user enters everything correctly.

Zero
  • 1,562
  • 1
  • 13
  • 29
  • This is a good start, but this algorithm assumes that [these words](http://pe.usps.com/text/pub28/28apc_002.htm) don't appear in the street name, and it doesn't consider directionals. Street suffixes are not always present, and some streets don't have suffixes (I live on a street without a suffix. In fact, my street name is a number.) This also won't work too well if the user's input has spelling errors or something is out of order. It might also have trouble if a city or state name or directional is the street name, which is fairly common. – Matt Oct 16 '14 at 02:09
-4

Maybe you can use Regular Expression

Stuk4
  • 33
  • 3