I have a huge amount (1500 Million) of Integer pairs where each one is associated with a document-ID. My goal now is to search for documents which have the same pair.
My first idea was to use a hash-map (std::map
) using the pair values as keys and the document-IDs as associated values, i.e. map<pair<int,int>, unordered_set<int>>
For example:
Document1
- pair1: (3, 9)
- pair2: (5,13)
Document2
- pair1: (4234, 13)
- pair2: (5,13)
map<pair<int,int>, unordered_set<int>> hashMap
hashMap[{3, 9}].insert(1)
hashMap[{5, 13}].insert(1)
hashMap[{4234, 13}].insert(2)
hashMap[{5, 13}].insert(2)
would result into
Key(3,9) = Documents(1)
Key(5,13) = Documents(1,2)
Key(4234,13) = Documents(2)
My problem now is that this takes a huge amount of memory which exceeds my available 24 GB of RAM. Therefore I need an alternative with good performance for inserts and lookups which can fit into my memory. In theory I'm using 1500 Million * 3 (PairVal1, PairVal2, Document-ID) * 4 (bytes per Integer) = 18GB
when overhead costs are not taking into account. So are there any good alternatives for my problem?