0

I have a csv of names, transaction amount and an exact longitude and latitude of the location where the transaction was performed. I want the final document to be anonymized - for that I need to change it into a CSV where the names are hashed (that should be easy enough), and the longitude and latitude are obscured within a radius of 2km. I.e, changing the coordinates so they are within no more than 2 km from the original location, but in a randomized way, so that it is not revertible by a formula. Does anyone know how to work with coordinates that way?

Gal Chen
  • 13
  • 4
  • Hashing the names won't anonymize them, you will still be able to correlate all the transactions of a single person. For the coordinates you could round them (eg to nearest 0.1 arcminute) rather than randomize them, but still deanonymization can often do things you don't expect! – Constance Mar 21 '18 at 07:52

1 Answers1

1

You could use locality sensitive hashing (LSH) to map similar co-ordinates (i.e. within a 2 KM radius), to the same value with a high probability. Hence, co-ordinates that map to the same bucket would be located closer together in Euclidean space.

Else, another technique would be to use any standard hash function y = H(x), and compute y modulo N, where N is the range of co-ordinates. Assume, your co-ordinates are P = (500,700), and you would like to return a randomized value in a range of [-x,x] KM from P.

P = (500,700)
Range = 1000 #1000 meters for example
#Anonymize co-ordinates to within specified range
ANON_X = hash(P[0]) % Range
ANON_Y = hash(P[1]) % Range
#Randomly add/subtract range
P = (P + ANON_X*random.choice([-1,1]), P+ANON_Y*random.choice([-1,1]))
gratio
  • 83
  • 9
  • 1
    if you "return a randomized value in a range of [-x,x] KM from P", then all somebody needs to do is take lots of them with the same P and average them, then they find P. – Constance Mar 21 '18 at 10:14
  • also `random.choice([-1, 1])` only chooses *either* -1 or 1, nothing inbetween! – Constance Mar 21 '18 at 10:15
  • Averaging over a large number of P's would theoretically work, but I am assuming that the original co-ordinates are sparsely distributed in Euclidean space, hence you wouldn't be able to take many points to average over. `random.choice()` would work because `ANON_X` already stores a random value within the provided range, and we have to decide to either add or subtract the value in the range. – gratio Mar 22 '18 at 03:36