1

Geohash string is a feature in my sparse logistic regression model. So I used java string hashCode to generate int value on geohash string in order to get feature id. But I found hashCode method performs badly on similar geohash strings. It cause different features has the same feature id which may be bad in model optimization even the feature is similar. For example, those similar geohash string pairs have the same hashCode.

<"wws8vw", "wws8x9">
    "wws8vw".hashCode() = -774715770
    "wws8x9".hashCode() = -774715770
<"wmxy0", "wmxwn">
    "wmxy0".hashCode() = 113265337
    "wmxwn".hashCode() = 113265337

I guess it has some relationship between the geohash generator method and java hashCode method. So, anyone can explain me the true reason and how to decrease collisions on geohash string?

Andy Turner
  • 137,514
  • 11
  • 162
  • 243
formath
  • 319
  • 3
  • 17
  • 2
    Why? Because that's how `String.hashCode` is implemented, and it works well for strings in general. Now another why question: why do you care? What is the problem are experiencing? – Andy Turner Jul 29 '16 at 07:14
  • 1
    What kind of objects are these? If they are taken as `String`s, the hashCodes are different. – Dominik Sandjaja Jul 29 '16 at 07:17
  • 2
    @DaDaDom I think OP means that `"wws8vw".hashCode() == "wws8x9".hashCode()` etc. See [this](http://ideone.com/LaoEH9). – Andy Turner Jul 29 '16 at 07:17
  • Oh, OK. I thought that the pairs itself were matching. Well, I agree with your first comment: It looks like @formath is trying to solve the wrong problem. – Dominik Sandjaja Jul 29 '16 at 07:20
  • @formath [It sounds like you are asking the XY problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). What are you *really* trying to do? – Andy Turner Jul 29 '16 at 07:21
  • The "true reason" is that it just so happens that the contribution to the hash codes of the last 2 characters in each string is identical. But this isn't anything that you can change using `String.hashCode()`, and nor should you need to if you understand what `hashCode()` is actually for. So, third time lucky, perhaps: What are you really trying to do? – Andy Turner Jul 29 '16 at 07:27
  • @AndyTurner I have optimized the question. Thanks. – formath Jul 29 '16 at 07:28
  • 1
    @formath Could you please say _why_ you care about these collisions? – J Fabian Meier Jul 29 '16 at 07:31
  • @JFMeier I use geohash as a feature in my machine learning model. The feature id could be same for different feature if using hasdCode to generate sparse feature id which may affects model metrics. – formath Jul 29 '16 at 07:38

1 Answers1

5

I think that you are misunderstanding the purpose of the Object.hashCode() method - not hashing in general, but the reason why Java objects have this method:

This method is supported for the benefit of hash tables such as those provided by HashMap.

So if you are trying to use this method as an input to a machine learning model, you're not using it for its intended purpose.

The answer is reasonably obvious: you need to design your own hashing method - or select a pre-existing one - which gives you the desired collision profile for your expected inputs. The one used by String.hashCode() can't be changed by you.

Andy Turner
  • 137,514
  • 11
  • 162
  • 243
  • 1
    As addition: I would try to understand the "subspace" of string that is covered by your geohash values, like e.g. only Strings of length 5 with only a-z0-9 as characters. Then you can design a better map to int32. – J Fabian Meier Jul 29 '16 at 07:45
  • @JFMeier this was what I meant for "for your expected inputs". But yes. – Andy Turner Jul 29 '16 at 07:46
  • @AndyTurner "you need to design your own hashing method - or select a pre-existing one". Do you have some recommendations? – formath Jul 29 '16 at 07:58
  • @formath I don't (genuinely, I'm not just being obtuse). You'll need to do some research. Just as an idea, though, why expand the geohash to its encoded latitude/longitude, and hash those together: `double lat = ...; double lng = ...; int hash = Objects.hash(lat, lng);`. – Andy Turner Jul 29 '16 at 07:59