1

To ensure data privacy, I have to publish a list of addresses after removing the street numbers.

So, for example:

1600 Amphitheatre Parkway, Mountain View, CA

needs to be published as

Amphitheatre Parkway, Mountain View, CA

What's the best way to do this in Java? Does this require regex?

Matt
  • 22,721
  • 17
  • 71
  • 112
natchy
  • 2,465
  • 2
  • 15
  • 6
  • Are you just trying to replace all numerical values with the empty string? – I82Much Sep 03 '10 at 14:20
  • 1
    That won't make sense for something like "120 7th Street NW". Also, are you limited to US addresses and will they always be in "Street, City, State" format? – Jaime Garcia Sep 03 '10 at 14:21
  • Removing numbers is not enough. Guess who lives at "One Microsoft Way". :-) – Steven Sudit Sep 03 '10 at 14:22
  • Don't forget about P.O. boxes, apartment, floor and suite numbers, etc. Would those need to be removed as well? – Eric Mickelsen Sep 03 '10 at 14:23
  • If you have a ZIP code you might also use that to retrieve an address. That way you won't ever accidentally publish a house number. – extraneon Sep 03 '10 at 14:25
  • @tehMick: They're all actually international billing addresses for customers. @extraneon: Not sure what you mean about the zip/postal code. Obviously there will be multiple customers with the same zipcode so I'm not sure what you mean about using it to "retrieve an address". – natchy Sep 03 '10 at 14:30
  • @natchy: He means that rather than sanitizing the address, lookup what range of addresses the zipcode corresponds to and use that as your sanitized address. This may or may not be a good idea. – Brian Sep 03 '10 at 15:40

4 Answers4

3

EDIT : How about...

addressString.replace("^\\s*[0-9]+\\s+","");

or JavaScript...

addressString.replace(/^\s*[0-9]+\s+/,'');

My original suggestion was (JavaScript)...

addressString.replace(/^\s*[0-9]+\s*(?=.*$)/,'');
El Ronnoco
  • 11,753
  • 5
  • 38
  • 65
  • Be careful not to call it twice on '123 2nd Street, Nowhereville' – Wrikken Sep 03 '10 at 14:55
  • I did intend that it was only called once per line :D – El Ronnoco Sep 03 '10 at 15:02
  • In fact `/^\s*[0-9]+\s+/` is simpler and probably works better. The lookahead isn't necessary. Also this will ensure that '7th street' doesnt get turned into 'th street' – El Ronnoco Sep 03 '10 at 15:23
  • @Wrikken My updated answer will be safe to use on this as it insists on a following whitespace character. – El Ronnoco Sep 03 '10 at 15:29
  • 1
    The OP has already asked a separate question asking this to be translated into valid Java code, but would you care to fix it here for posterity? It should be `addressString.replace("^\\s*[0-9]+\\s*(?=.*$)", "");` – Mark Peters Sep 03 '10 at 15:33
  • Also @El Ronnoco: I'm not sure that makes it idempotent. Is it universally accepted that no road begins with a number not followed by "th" or "st", etc? For example I can think of a local road named "Twenty Road" and I can image somebody listing their address as "415 20 Road, ...". I think your solution is exactly what the OP asked for; I just think the real solution here is to use an existing library that takes into consideration locales, etc and even looks it up in a database like Google Maps before stripping the street number. – Mark Peters Sep 03 '10 at 15:41
  • @Mark Peters: Apologies - I put my code in a JavaScript syntax. I shall update. With regards to the second point - I don't think I have ever seen an example of an address of the form "1st High Street" - certainly not in the UK (although the OP is from the US apparently). '415 20 Road' would be replaced to '20 Road' as the regex insists on matching a following whitespace. However '20 Road' would be changed to 'Road' . I'm not sure exactly how critical the OPs problem is but for a quickfix initial datacleanse this seems simpler than (sourcing and) plugging into an existing library solution – El Ronnoco Sep 06 '10 at 11:25
  • @Mark Peters - Further apologies - I've just looked up what idempotent means :D I think the OP will just be iterating address lines and performing one operation per line. – El Ronnoco Sep 06 '10 at 14:05
3

This is a technically difficult problem to solve. But I don't think that matters.

You say you want to strip out the street number from the address to ensure data privacy. How in the world do you think that ensures privacy? I mean, it might give a little privacy to those who live on a street with a few thousand homes, but on a medium street it narrows it down to a few hundred people; on a small street there are maybe a few choices and on some rural roads it may tell you exactly which house the address corresponds to.

This is not sanitization.

The problem is then compounded greatly if you are associating any other data with that address.

Mark Peters
  • 80,126
  • 17
  • 159
  • 190
  • +1 because even though the regex answer technically addresses the question, THIS answer seems much more relevant. – Jake Sep 03 '10 at 15:09
1

One possibility is to use a CASS system that typically will parse the address and return in XML. Then, you can easily grab the street name, city, and state, ignoring the street number.

pkananen
  • 1,325
  • 1
  • 11
  • 23
0

Natchy, I work for an address verification company called SmartyStreets: and parsing street addresses is our area of expertise. I'll reinforce what pkananen and Mark have said in that this is far beyond the capabilities of regular expressions and anyway -- data privacy aside -- your current approach is less effective than others.

The USPS authorizes certain vendors of address parsers to use their official data and return certified results, specifically, "CASS-Certified." Usually CASS is associated with mailings, but extends well into the realm of what you need to do. There are APIs (for point-of-entry stuff) and batch services (like uploading a list) that will validate and componentize an address.

When an address is broken into components, it's very easy to use only the pieces you actually need. You'll also verify that the address exists, is complete, accurate, and will serve your purposes.

For example, on LiveAddress' API page (which you can use as a springboard for your own research), you can see how it works and, from the docs, that you can pick and choose which pieces of the addresses you'll want to display or store. (Funny thing! Our default sample address on that page is also Google's address in Mountain View, CA.)

If you have any further questions about parsing addresses, I'll be happy to personally help you.

Matt
  • 22,721
  • 17
  • 71
  • 112