0

I have been fighting with this for a while, so hopefully someone can help me out. I'm open to any and all suggestions.

When I query QGeoAddress::street(), I (may) receive both the street number, plus the street name. I would like to get just the street name.

Example:

King St W -> King St W
99 King St W -> King St W
99a King St W -> King St W ...

1st St -> 1st St
99 1st St -> 1st St
99a 1st St -> 1st St ...

315 W. 42nd -> W. 42nd
42 St. Paul Drive -> St. Paul Drive

I need to do this so that the location of two separate devices can be compared via the most recent street name. If a device is at "99 King St W", it is on the same street as "113 King St W", or "113a King St W".

As it stands, I don't believe regex is a good, reliable solution as there are too many rules to impose and the variability of street names is working against me. Theoretically, there may be a street called "1 St", which would fail the regex normalizing "1 1st St".

Writing my own fuzzy matcher may provide better results, but may fail for shorter street names.

I have also considered querying a REST web service, however many of the free services have limitations on requests per day, or a minimum time between requests that would render that method too expensive.

Like I say, I'd love to hear what you guys can come up with.

Much appreciated :)

Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
Justin Jasmann
  • 2,363
  • 2
  • 14
  • 18
  • The problem isn't how to program it; the problem is, as you seem to understand, how to define what needs to be programmed. Especially as there isn't any real rule: in France, the house number will come after the street, for example ("rue Dupont 42"). – James Kanze Jun 06 '13 at 17:15
  • There are a lot of assumptions I would need to make when creating these rules, which plays into why I think regex has it's weakness. It would be much better if I could find an API that would provide separate, normalized address fields, that include a street name. – Justin Jasmann Jun 06 '13 at 17:52
  • The problem is that such a thing really can't exist, because there is no such standardization with regards to addresses. What do you do if the address is `"PO Box 42"`, for example. The problem is that it's the wrong question. If you have to solve it (e.g. because of incompetent management), the best you can do is invent and implement some ad hoc heuristic, and hope for the best. – James Kanze Jun 06 '13 at 17:55
  • I know what you mean. I think for this use case, the most important point is that one address can compare against another with reasonable accuracy, rather than normalizing an individual street address. Given that, the best solution I can think of would be a fuzzy match on longest common substring, though it may fail for `"PO BOX 42"` & `"PO BOX 43"`...arg – Justin Jasmann Jun 06 '13 at 18:08
  • Is the problem to find potentially neighboring addresses? If so, the heuristic might be different. In that case, I'd try for a minimum editing difference, perhaps replacing strings of digits with the number value (so that 99 and 100 would have the same edit difference as 98 and 99). Or better: use some sort of geographical data base, which returns the exact latitude and longitude for a given address. – James Kanze Jun 06 '13 at 18:20
  • Precisely! This is a point towards a fuzzy match, comparing the levenshtein distance against some acceptability threshold. – Justin Jasmann Jun 06 '13 at 19:15
  • The only problem with that is for longer addresses, with a single character difference (which may be of great importance). These may still score above the threshold. – Justin Jasmann Jun 06 '13 at 19:22

2 Answers2

2

Description

This regex will look for the street St or avenue Ave and capture the preceding word and the rest of the line. I made the expression allow St or Ave incase you wanted to expand the test beyond streets just called "xxx street", if your use case requires just St then replace the (St|Ave) with just St.

(\b\S*\b\s(St|Ave)\b.*?)$

enter image description here

Example

I only include this PHP example to demo how the expression works and what the group captures will look like

<?php
$sourcestring="King St W 
99 King St W 
99a King St W 

1st St 
99 1st St 
99a 1st St";
preg_match_all('/(\b\S*\b\s(St|Ave)\b.*?)$/m',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>

$matches Array:
(
    [0] => Array
        (
            [0] => King St W 
            [1] => King St W 
            [2] => King St W 
            [3] => 1st St 
            [4] => 1st St 
            [5] => 1st St
        )

    [1] => Array
        (
            [0] => King St W 
            [1] => King St W 
            [2] => King St W 
            [3] => 1st St 
            [4] => 1st St 
            [5] => 1st St
        )

    [2] => Array
        (
            [0] => St
            [1] => St
            [2] => St
            [3] => St
            [4] => St
            [5] => St
        )

)
Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
  • Of course, this doesn't work if the street name doesn't have "St" or "Ave" on it, which is often the case outside English-speaking countries. – user2448027 Jun 06 '13 at 17:51
  • Thanks for the work! I'm no wizard in regular expressions, so I like to see other examples. A large part of the problem is the inconsistency in the address response. It 'may' or 'may not' contain a street number and I don't really know if 'St' is standard to represent 'Street' from this API. – Justin Jasmann Jun 06 '13 at 17:54
  • Of course, this doesn't work except in a few common cases. What if I spell it `"Street"`? Or misspell it `"Streat"`? What if the address is a PO Box? What if I just write "315 W. 42nd" (unambiguous in New York)? And how about `"42 St. Paul Drive"`? – James Kanze Jun 06 '13 at 18:00
  • I'm taking a fairly big assumption that the street address coming back from this API call is of a certain level of quality. I'm not sure how dangerous that is. – Justin Jasmann Jun 06 '13 at 18:11
  • Your original question showed some pretty generic examples. I edited your question to show the new examples you included in the comments here. Given the complexity of the task I'm sure an expression can be written to cover most of your edge cases, however maintaining that expression will become nigh impossible as you discover more edge cases which need to be covered. You would be better off writing a string of if then logic like "if the address fits this format, then ...;" – Ro Yo Mi Jun 06 '13 at 18:38
  • Thanks Denomales. It would be hard for me to show all of the edge cases, especially if we leave North America. I can hit most of the cases with regex or if logic (like you say), but for each of the addresses I can't think of, those matchers would likely fail. – Justin Jasmann Jun 06 '13 at 18:58
2

As I said in the comments, the problem here is that the wrong question is being asked. But if you have to, and you can exlude PO boxes (the string ends in a number?), and you limit yourself to addresses in the USA (because you wouldn't believe some of the things you see in the UK), then you might start by detecting a leading number, then appending everything that isn't separated from it by a space. It's hardly perfect, because there'll always be people who write "99 A King St.", rather than "99a King St.". (But then, in the first, is the name of the street "King St." or "A King St."? Unless you know the street yourself, you can't be sure.) The regular expression for this would be "\\d+\\w*". Beyond that, you can try certain heuristics with the results: if they are a single word, exactly matching "St", "Street", "Ave", etc. (there are probably about 20 different words you should check, with or without trailing "." in the case of abbreviations), then you probably have just the street.

But before even starting, I would insist that you query the assignment. It's well known, for example, that when inputting addresses, about all you can do is "First line:", "Second line:", etc. Even asking for a post code can be tricky.

James Kanze
  • 150,581
  • 18
  • 184
  • 329
  • Why I chose such a difficult task, I do not know. I was assuming `QGeoAddress::street()` would give me a street, instead of an entire address. Do you have any other suggestions of how you would determine if two devices are on the same street, ignoring an address line? – Justin Jasmann Jun 06 '13 at 18:20
  • What is the relevance as to whether two devices are on the same street? If it is important, the only real solution is to use some sort of geographical data base, which can convert the address into latitude and longitude, and then reconvert the latitude and longitude into a canonical address. – James Kanze Jun 06 '13 at 18:23
  • I wanted to compare two speeds, recorded at different times, on the same street. I'm starting to believe that I'll need a database to match against. – Justin Jasmann Jun 06 '13 at 19:04