1

I have a structure which describes the address, it looks like:

class Address
{
    public string AddressLine1 { get; set; }
    public string AddressLine2 { get; set; }
    public string City { get; set; }
    public string Zip { get; set; }
    public string Country { get; set; }
} 

I'm looking for a way to create an unique identifier for this structure (I assume it should be also of a type of string) which is depend on all the structure properties (e.g. change of AddressLine1 will also cause a change of the structure identifier).

I know, I could just concatenate all the properties together, but this gives too long identifier. I'm looking for something significantly shorter than this.

I also assume that the number of different addresses should not be more than 100M.

Any ideas on how this identifier can be generated?

Thanks in advance.

A prehistory of this:

There are several different tables in the database which hold some information + address data. The data is stored in the format similar to the one described above.

Unfortunately, moving the address data into a separate table is very costly right now, but I hope it will be done in the future.

I need to associate some additional properties with the address data, and going to create a separate table for this. That's why I need to unique identify the address data.

Oleks
  • 31,955
  • 11
  • 77
  • 132
  • Please give us more context. There's almost certainly a better way of approaching the problem. – Jon Skeet Apr 07 '13 at 11:54
  • 1
    There is no way to make a perfect hash function in the general sense. You need to have all those 100M unique addresses first, then there are algorithms and software out there that can create your function that will map each one into a unique number without necessarily storing them all. As Jon said, there is very likely a better way to approach your problem than trying to make a perfect hash. – Lasse V. Karlsen Apr 07 '13 at 11:57

3 Answers3

3

Serialize all fields to a large binary value. For example using concatenation with proper domain separation.

Then hash that value with a cryptographic hash of sufficient length. I prefer 256 bits, but 128 are probably fine. Collisions are extremely rare with good hashes, with a 256 bit hash like SHA-256 they're practically impossible.

CodesInChaos
  • 106,488
  • 23
  • 218
  • 262
  • Thanks for the answer. This seems to be the simplest solution. I already had something similar to this before asking the question :) But I'd like to wait, maybe someone will offer another solution for this. – Oleks Apr 07 '13 at 13:34
1

Here is a possible way most people do think about:

  1. Normalize the address
  2. Create a hash from the normalized address
  3. Done…

But the real problem comes when you need to normalize the address. For instance, those streets are the same:

  • "place Saint-François 14"
  • "place saint françois 14"
  • "place st. françois 14"
  • "place st. francois 14"
  • "14 Place saint François"

You could try to normalize the address lower casing the text, removing accents/cedillas/dashes and with the closest ASCII char and parsing the number to keep it aside, but there will still unforeseen exceptions. And, a single different char will produce a completely different hash.

Unless all your addresses are perfectly normalized, I would suggest relying on an external service like here.com

There are 3 ways of using the service

  1. Either use the service to find the coordinates of your address (long, lat, altitude) then use those as your id (or 3 ids)
  2. Or you could use the service to find the address and keep their own ID in your DB. The drawback is that they do not guarantee that their ID will not change.
  3. The last is to use their service to find your address into their registry, then use their entry (which will be normalized according to their standard) to create the hash.

My favorite goes to 1. as we can still find the address back from the coordinates (while this is impossible with hash) moreover, an address might change (new street name for instance) while coordinates should not. Last but not least, you might have 2 completely different addresses for the same location, this is easier to reconcile them using coordinates.

Flavien Volken
  • 19,196
  • 12
  • 100
  • 133
0

Here is a complete example using serialization, sha256 hashing and base64 encoding (based on CodesInChaos answer):

using System;
using System.IO;
using System.Security.Cryptography;
using System.Runtime.Serialization.Formatters.Binary;

namespace Uniq
{
    [Serializable]
    class Address
    {
        public string AddressLine1 { get; set; }
        public string AddressLine2 { get; set; }
        public string City { get; set; }
        public string Zip { get; set; }
        public string Country { get; set; }
    } 
    class MainClass
    {
        public static void Main (string[] args)
        {
            Address address1 = new Address(){AddressLine1 = "a1"};
            Address address2 = new Address(){AddressLine1 = "a1"};
            Address address3 = new Address(){AddressLine1 = "a2"};
            string unique1 = GetUniqueIdentifier(address1);
            string unique2 = GetUniqueIdentifier(address2);
            string unique3 = GetUniqueIdentifier(address3);
            Console.WriteLine(unique1);
            Console.WriteLine(unique2);
            Console.WriteLine(unique3);
        }
        public static string GetUniqueIdentifier(object obj){
            if (obj == null) return "0";
            SHA256 mySHA256 = SHA256Managed.Create ();
            BinaryFormatter formatter = new BinaryFormatter ();
            MemoryStream stream = new MemoryStream();
            formatter.Serialize(stream, obj);
            byte[] hash = mySHA256.ComputeHash(stream.GetArray());
            string uniqId = Convert.ToBase64String(hash);
            return uniqId;
        }
    }
}

Edit: this is a version without using BinaryFormatter. You may replace the null representation and the field separator to anything that suits your needs.

public static string GetUniqueIdentifier(object obj){
    if (obj == null) return "0";
    SHA256 mySHA256 = SHA256Managed.Create ();
    StringBuilder stringRep = new StringBuilder();
    obj.GetType().GetProperties()
                .ToList().ForEach(p=>stringRep.Append(
            p.GetValue(obj, null) ?? '¨'
            ).Append('^'));
    Console.WriteLine(stringRep);
    Console.WriteLine(stringRep.Length);
    byte[] hash = mySHA256.ComputeHash(Encoding.Unicode.GetBytes(stringRep.ToString()));
    string uniqId = Convert.ToBase64String(hash);
    return uniqId;
}
Ahmed KRAIEM
  • 10,267
  • 4
  • 30
  • 33
  • 2
    I'm not a fan of using `BinaryFormatter` for this. You want some kind of function that *guarantees* giving the same result every time you call it, no matter which version of .net or mono you use. I don't think `BinaryFormatter` does guarantee that. I'd probably use [netstrings](http://en.wikipedia.org/wiki/Netstring) together with concatenation on the individual values. – CodesInChaos Apr 07 '13 at 13:15
  • You also have a bug: `stream.GetBuffer()` should be `stream.ToArray()`. – CodesInChaos Apr 07 '13 at 13:15