In the code below, I have a number of strings (DNA sequences)
that I am storing in a vector. I have a struct
, read_tag
that I use to identify each string; read_tag.read_id
is the string identifier. I take 30 character substrings of each string and use it as a key in an unordered_multimap
with the read_tag
as a value; the aim is to group strings that share a 30 character sequence. Naturally, identical string will hash to the same value, and end up in the same bucket in the multi map. Offset is used to give the "shift" from index zero of the 30 character tag.
However, when I run this code, iterating through each bucket; I find that there are multiple different sequences in the same bucket. I thought collisions were resolved in unordered_mutlimap
, and therefore in a bucket, their should just be one key (string). I understand that collision can happen, but I thought that either chaining, probing etc was implemented in unordered_mutlimap
.
You should be able to run and check output to see where I am getting confused.
I also std::hash
the each key, one in a bucket, and I find that the keys in the "collisions" have a different hash value.
So, it's as if collisions are happening, resulting in values with diff. keys in the same bucket, but in contradiction, the keys hashed to different vals. Is their a way to avoid this, and differentiate values based on keys within buckets? Or do I need to do implement this?
#include <iostream>
#include <string>
#include <unordered_map>
#include <vector>
#include <functional>
using namespace std;
int main() {
vector<string> reads;
reads.push_back("CCAGCTGCTCTCACCCTGGGCAGGGTCCCTGCACACACTGTATCTTTTGAGGTCCCTTCAGGACCCCGGTTTGCTGCCTC");
reads.push_back("CCAGCTGCTCTCACCCTGGGCAGGGTCCCTGCACACACTGTATCTTTTGAGGTCCCTTCAGGACCCCGGTTTGCTGCCTC");
reads.push_back("GGCAGGGTCATACCCGATTAACTTGTTATAGAGTATGGGGCATCAACTTGGGCAGCAATGGGGAACGGTGTCTCTGGAAG");
reads.push_back("CCAGCTGCTCTCACCCTGGGCAGGGTCCCTGCACACACTGTATCTTTTGAGGTCCCTTCAGGACCCCGGTTTGCTGCCTC");
reads.push_back("GGCAGGGTCATACCCGATTAACTTGTTATAGAGTATGGGGCATCAACTTGGGCAGCAATGGGGAACGGTGTCTCTGGAAG");
reads.push_back("GGCAGGGTCATACCCGATTAACTTGTTATAGAGTATGGGGCATCAACTTGGGCAGCAATGGGGAACGGTGTCTCTGGAAG");
reads.push_back("GGCAGGGTCATACCCGATTAACTTGTTATAGAGTATGGGGCATCAACTTGGGCAGCAATGGGGAACGGTGTCTCTGGAAG");
reads.push_back("CCGGGCGTGGTGGCGTGCACCTGTAATCCCAGCTACTTGGGATGTTCAGGCAGGAGACTCGCTTGATCCCCGGGGACGGA");
reads.push_back("CCGGGCGTGGTGGCGTGCACCTGTAATCCCAGCTACTTGGGATGTTCAGGCAGGAGACTCGCTTGATCCCCGGGGACGGA");
reads.push_back("CCGGGCGTGGTGGCGTGCACCTGTAATCCCAGCTACTTGGGATGTTCAGGCAGGAGACTCGCTTGATCCCCGGGGACGGA");
reads.push_back("CCGGGCGTGGTGGCGTGCACCTGTAATCCCAGCTACTTGGGATGTTCAGGCAGGAGACTCGCTTGATCCCCGGGGACGGA");
reads.push_back("CCAGCTGCTCTCACCCTGGGCAGGGTCCCTGCACACACTGTATCTTTTGAGGTCCCTTCAGGACCCCGGTTTGCTGCCTC");
struct read_tag{
unsigned int read_id; // unique string identifier
int offset; // shift of 30 character substring represented by tag
};
unordered_multimap<string, read_tag> mutation_grouper;
for(int read_id=0; read_id < reads.size(); read_id++) {
string read = reads[read_id];
for(int i=0; i < read.size()-30; i++) {
string sub_read = read.substr(i, 30);
read_tag next_tag;
pair<string, read_tag> key_val;
next_tag.read_id = read_id;
next_tag.offset = i;
key_val.first = sub_read;
key_val.second = next_tag;
mutation_grouper.insert(key_val);
}
}
cout << "mutation_grouper buckets" << endl;
std::hash<std::string> hash_er;
for(unsigned int bucket = 0; bucket < mutation_grouper.bucket_count(); bucket++) {
cout << "Bucket: " << bucket << endl;
for( auto local_it = mutation_grouper.begin(bucket);
local_it != mutation_grouper.end(bucket); ++local_it) {
cout << local_it->first << " : " << local_it->second.read_id
<< ", " << local_it->second.offset << ", " << endl;
cout << "hash value: " << local_it->first <<"::: " << hash_er(local_it->first) << endl;
}
cout << endl << endl;
}
}