I'm in the process of integrating a hash method (farmhash) to our software base. The hashing services seem to work appropriately. Basically, it turns a string of characters into an unique-ish integer value.
I've added an infrastructure to detect collisions (in a case where two input strings would result in the same output integer). Basically, for each string that is hashed, I keep the [hash result] -> [string]
in a map, and every time a new string is hashed, I compare it to what's in the map; if the hash is already there, I make sure that it is the same string that has generated it. I am aware that it's potentially slow and it's potentially memory consuming, but I'm performing theses checks only on a "per request" basis: they are not enabled in release mode.
Now I'd like to test that infrastructure (as in get a collision, from a unit test point of view).
I could generate a bunch of strings (random or sequential), spam my hash infrastructure and hope to see a positive collision but I feel I'll waste my time, CPU cycles and fill the memory with a load of data without success.
How would one go about generating collisions?
Not-so-relevant-facts:
- I'm using c++;
- I can generate data using python;
- The target int is uint32_t.
Update:
I have created a small naive program to brute force the detection of collision:
void
addToQueue(std::string&& aString)
{
//std::cout << aString << std::endl;
hashAndCheck( aString ); // Performs the hash and check if there is a collision
if ( mCount % 1000000 )
std::cout << "Did " << mCount << " checks so far" << std::endl;
mQueue.emplace( aString );
}
void
generateNextRound( const std::string& aBase )
{
//48 a 122 incl
for ( int i = 48; i <= 122; i++ )
{
addToQueue( std::move( std::string( aBase ).append( 1, static_cast<char>( i ) ) ) );
}
}
int main( void )
{
// These two generate a collision
//StringId id2 = HASH_SID( "@EF" ); // Hashes only, does not check
//StringId id1 = HASH_SID( "7\\:" ); // Hashes only, does not check
std::string base = "";
addToQueue( std::move( base ) );
while ( true )
{
const std::string val = mQueue.front();
mQueue.pop();
generateNextRound( val );
}
return 0;
}
I could eventually have added threading and stuff in there but I didn't need it because I found a collision in about 1 second (in debug mode).