1

We are developing a c# application that imports address-data into a CRM-system. The CSV-file contains an address-column like 'Somethingstreet 34'. Our CRM however uses two different fields for the streetname and the housenumber. Of course, in the given example this poses no problem. But our Dutch addressing-system can be a bit of a pain.

Real world examples:

  • Somestreet 88a (where 'Somestreet' is the streetname and 88a the housenumber)
  • 2e van Blankenburgstraat 123a (where '2e van Blankenburgstraat' is the streetname, and '123a' is the housenumber)
  • 2e van Blankenburgstraat 123-a (where '2e van Blankenburgstraat' is the streetname, and '123-a' is the housenumber)
  • 2e van Blankenburgstraat 123 a (where '2e van Blankenburgstraat' is the streetname, and '123 a' is the housenumber)

Now I'm looking for a nice function (RegEx or something) that splits these addresslines correctly into two fields. Is there a nice clean way to do this ?


edit:

I did some further investigation on our addressing system and it seems (thank you government) that the above examples are not even the 'worst' ones.

Some more (these are real streets and numbers):

  • Rivium 1e Straat 53/ET6 (where 'Rivium 1e Straat' is the street and '53/ET6' is the housenumber)
  • Plein 1940-1945 34 (where 'Plein 1940-1945' is the street and '34' is the housenumber)
  • Apollo 11-Laan 11 (where 'Apollo 11-Laan' is the street and '11' (the second one) is the housenumber)
  • Charta 77 Vaart 159 3H (where 'Charta 77 Vaart' is the streetname and '159 3H' is the housenumber)
  • Charta 77 Vaart 44/2 (where 'Charta 77 Vaart' is the streetname and '44/2' is the housenumber)
Matt
  • 22,721
  • 17
  • 71
  • 112
WowtaH
  • 1,488
  • 2
  • 11
  • 19
  • It seems like this is not strictly a programming question but a data analysis problem. – mfloryan Jun 29 '09 at 17:42
  • 2
    As I haven't seen DataOverflow.com yet, and the problem asked for a RegEx and notes C# as the language... this seems more concrete than many of the questions asked here. What has been left underdefined is what is the Dutch addressing system (is it *always* one of the choices)? – Godeke Jun 29 '09 at 18:19
  • i've added some extra examples – WowtaH Jun 29 '09 at 18:37

4 Answers4

1

The best solution for data correctness would be to compare the existing database against a known address api that has a function to do this for you. Otherwise you're just giving your best guess and some, if not all, of the data should be manually reviewed.

Greg
  • 16,540
  • 9
  • 51
  • 97
0

What I did, but I doubt that it is the most performant solution is to reverse the address and then get the first part till you find a digit and take them all. i.e. the regex .*\d+ on the reversed address. This solves your problem when a street contains a digit.

Ruben
  • 6,367
  • 1
  • 24
  • 35
0

Can you do something where you split on spaces, and then check to see if the first character of some interior string is an integer?

like

 char[] splits = new char[1];
 splits[0] = ' ';
 string[] split = addressLine.split(splits);
 int splitLoc = -1, i;
 for (i =1; i < split.Length; i++){//start at 1 to avoid the first '2e' streets
     int theFirstDigit = -1;
     try{
        theFirstDigit = int.Parse(split[i].Substring(0,1));
     }catch {
        //ignore; parse fails with an exception
     }
     if (theFirstDigit != -1){
         splitLoc = i;
         break;
     }
 }
 if (splitLoc < 0) return; //busted
 string field1, field2;
 for (i = 0; i < splitLoc; i++){
     field1+= split[i] + " ";
 }

 for (i = splitLoc; i < split.Length; i++){
     field2+= split[i] + " ";
 }

Depends on what you mean by 'clean', but it does look like that would work, if all addresses can be formed the way you specified.

mmr
  • 14,781
  • 29
  • 95
  • 145
0

There are too many different ways someone could enter this data. I often write my address as:

123 Foo Street Apt#3

ie with the house and apartment numbers on either end of the street name

If this was my problem I would write a regex that handles the "easy" ones and flags the complicated ones for human review.

You can find a list of street names in the US from the Census Bureau but it is buried inside a monster datafile

Autodidact
  • 768
  • 1
  • 9
  • 11