-1

I am looking for any libraries in java that can parse an address out of a normal String of text. The text could contain all types of special and non-special :( characters but all I really want to pull out of the original string is a rough address string.

In other words, how would I pull an address out of a random String containing an address in it somehwere? The format doesn't really matter much, as long as the output has got the street and number in it somewhere. Would you use regular expressions for this if there aren't any libraries?

El Duderino
  • 499
  • 2
  • 4
  • 15
  • 1
    You need to provide more details about the input string. Is it guaranteed to be a string containing a single address and nothing else? Is it a paragraph containing a single address somewhere in it? Can there be multiple addresses in the string? The underlying problem ranges from moderately simple (if the input is very constrained) to potentially difficult (consider the problem of finding all valid international addresses in a page of text that can be in an arbitrary language). – Stuart Golodetz Apr 03 '12 at 19:10
  • 1
    @StuartGolodetz I think the latter of your statements is true - he said he's wanting to pull addresses out of a "random String", so I think it's safe to say that he just wanted to find any addresses he can out of a huge pile of characters. No more info on the input string is necessary. – CodeBlind Apr 03 '12 at 19:24
  • Street address? For what countries? – Mike Clark Apr 03 '12 at 19:29
  • The question states a random string of varying special and non-special characters and specifically asks for libraries that can parse out character patterns or any good regex solutions. Random input string of special and non-special characters is exactly what I mean when I say it, in this case you can assume it's less than 500 characters. – El Duderino Apr 05 '12 at 17:36

1 Answers1

2

I don't know of any libraries that do this... but, this sounds like an excellent artificial intelligence problem :)

If you have any existing address books in ASCII/Unicode form, you could potentially use them to generate regex patterns, then run all known address regex patterns against your random text and see what comes out. This way you could kind of "teach" your algorithm how to behave based on known address formats. I suspect if any libraries do exist for this sort of thing, this is probably how they'd do it, because there are probably a TON of different ways to format a street address.

One example could be in the typical US street address. For instance:

1234 Main St. NW, Some City, ST, 12345 //[ST] = two-letter state abbreviation

You could write a regular expression that looks for two numbers and a state abbreviation in-between. Of course, this would only work for US street addresses, it wouldn't catch them all, and you'd have to be careful to constrain your regex to avoid false positives, but you could add that regular expression to your list of possibilities.

CodeBlind
  • 4,519
  • 1
  • 24
  • 36
  • This is a good solution and how I've begun to implement a solution. State abbreviation is not guaranteed actually, so I'm using logic that looks for a pure number (street number), then "records" tokens until I hit a common street address ending (as obtained from the USPS official street endings.) This works most of the time, but some of the abbreviations are problematic (BY for bayou for example). upvoted, but I'm leaving this open in the hopes someone will still have a killer regex library or parsing library, thanks for the answer! I'll accept in a few days if no more answers. – El Duderino Apr 05 '12 at 17:39