3

I have a block of text that includes name, maybe company name, and address, and maybe email address. I want to extract the street address out of that, and preferably name and address.

This data is siphoned from multiple sources, so I have no idea about the actual formatting. It could be something like this

Company name, owner@domain.com
ATTN John Doe
care of Company Name
123 Street St
New York, NY 12345
US
123-456-7890

But any of those lines could be rearranged or missing (phone number could come first, no ATTN or c/o, etc). Also, this could be from any country.

The goal is to a) plug the address into the Google Maps API, and b) create a contact with as much information as possible.

Here is a random idea I had:

  1. Take any line with an email address (can be found with a regex easily), store the email address and remove the line from further consideration.
  2. Take any line with a phone number (digits only, and [-+()]), store that number, and remove the line from further consideration.
  3. Take the last three lines and consider those the street address - plug them into Google Maps and hope for the best.

Obviously, that's a lot of juju magic. Is there a smarter approach? Are there are any libraries that have good regexes to look for street addresses of different countries?

EboMike
  • 76,846
  • 14
  • 164
  • 167
  • @Nemi: Nope, although the app I needed it for is on the backburner. Still an interesting problem, would be nice to find a solution for it. – EboMike Mar 15 '12 at 17:45
  • Best you can do for this problem is to train an entity resolution model – Yeikel Dec 10 '20 at 20:26

1 Answers1

0

Depends on your source. If you have control of how it arrives from your source, then you can do some formatting.

Keith Mattix
  • 401
  • 5
  • 13