2

I have a huge amount (1500 Million) of Integer pairs where each one is associated with a document-ID. My goal now is to search for documents which have the same pair.

My first idea was to use a hash-map (std::map) using the pair values as keys and the document-IDs as associated values, i.e. map<pair<int,int>, unordered_set<int>>

For example:

Document1

 - pair1: (3, 9)
 - pair2: (5,13)

Document2

 - pair1: (4234, 13)
 - pair2: (5,13)

map<pair<int,int>, unordered_set<int>> hashMap
hashMap[{3, 9}].insert(1)
hashMap[{5, 13}].insert(1)

hashMap[{4234, 13}].insert(2)
hashMap[{5, 13}].insert(2)

would result into

Key(3,9) = Documents(1) 
Key(5,13) = Documents(1,2) 
Key(4234,13) = Documents(2)

My problem now is that this takes a huge amount of memory which exceeds my available 24 GB of RAM. Therefore I need an alternative with good performance for inserts and lookups which can fit into my memory. In theory I'm using 1500 Million * 3 (PairVal1, PairVal2, Document-ID) * 4 (bytes per Integer) = 18GB when overhead costs are not taking into account. So are there any good alternatives for my problem?

Mad A.
  • 401
  • 4
  • 11
  • std::map is no hash map, you may want std::unordered_map –  Jun 14 '16 at 15:03
  • I cannot speak to it's efficiency, but but you might look at stxxl for this type of problem. http://stxxl.sourceforge.net/ – Dan Jun 14 '16 at 15:04
  • If the number of documents in the `set` is small then you could replace that by a `std::vector` – Galik Jun 14 '16 at 15:06
  • 3 billion integers is bound to require a lot of space. After you add to that your sets of document IDs, I'm not sure you can save much space by changing containers. – John Jun 14 '16 at 15:14
  • @DieterLücking using an `unordered_map` with with 1500 Million entries would take 50 GB already and that hasn't even the document-ID saved. So the overhead of `unordered_map` or `map` are too big, I've read somewhere that there are additional 32 bytes stored per hash-map entry, and that's the problem. – Mad A. Jun 14 '16 at 15:16
  • @John I was thinking about using a database, but not sure how long it takes to insert new values into an already huge database and check if the entry isn't existing yet. Another thought was saving the pairs as a `vector>` sort the vector and check for duplicates. – Mad A. Jun 14 '16 at 15:21
  • @MadA. Database sounds better than my file system idea, if you can get it set up. – John Jun 14 '16 at 15:22
  • @John I can set up a database, I'm just very bad at estimating if it will work in the end, therefore I wanted to get some opinions first. I feel like checking if a pair is existing would take pretty long time at 1500 Million entries. I could also use a GROUP BY in the end but then I have a problem when inserting new pairs to the database. – Mad A. Jun 14 '16 at 15:27
  • How are these stored now? – John Jun 14 '16 at 15:28
  • Have you tested a simple vector? You can hardly get more space efficient than that. What I can't say is how long it will take to sort it. – MikeMB Jun 14 '16 at 16:54
  • How many document do you (roughly) have per integer pair? – MikeMB Jun 14 '16 at 17:01
  • @MikeMB I have 1 Million Documents and each document has about 1500 pairs. The problem with the vector approach is that when new documents arise I have to resort it or insert in place, which means I've to iterate over the full vector each time, which is too slow. – Mad A. Jun 14 '16 at 17:06
  • Finding the insertion point in a sorted array is a log(n) operation and actually quite fast. If you have to insert a new number pair however, that would indeed become quite expensive. – MikeMB Jun 14 '16 at 17:48

3 Answers3

2

This might be a job for an embedded database such as SQLite or BerkeleyDB or Tokyo Cabinet.

If the amount of data you're using exceeds your RAM then you really do need something that can work from disk.

Zan Lynx
  • 53,022
  • 10
  • 79
  • 131
  • I was planning on doing that, but my database experience hindsight performance is not that good and I didn't want to rush anything that would not work in the end. So I'm unsure how long it would take to insert new pairs and before check if that pair already exists. I can imagine that this operations will be very costly having 1500 Million rows. Do you have any experience in that matter? – Mad A. Jun 14 '16 at 17:11
  • @MadA. I used Tokyo Cabinet for something similar years ago and it did 50,000 per second. – Zan Lynx Jun 14 '16 at 17:57
0

Can you use the file system?

Name directories after the first integer, create text files in each named for the second integer, each line of the text file can be a Document ID.

You're bound to suffer significant speed penalties on all of the I/O. Get as fast of a disk as you can. Storage requirements will grow significantly too, as directory names, file names, and file contents become ascii text instead of binary integers.

John
  • 7,301
  • 2
  • 16
  • 23
  • If you're using the disk, why not just have a swapfile or an `mmap()` backed allocator. That should have much lower overhead than file per entry. – joelw Jun 14 '16 at 15:25
  • @joelw Because `mmap()` is often signficantly slower than low-level `read()`/`write()` calls. http://marc.info/?l=linux-kernel&m=95496636207616&w=2 – Andrew Henle Jun 14 '16 at 15:29
  • @AndrewHenle This solution isn't low-level `read()`/`write()` Each read will be wrapped with an `open()`/`close()` pair. – joelw Jun 14 '16 at 15:37
0

One solution for reducing the space is instead of std::map<std::pair<int,int>, std::unordered_set<int>> use std::unordered_map<int, std::unordered_set<int>>

For convert std::pair<int, int> to int you must use a pairing function, for example:

Cantor’s Pairing Function

Obviously you're limited to using smaller numbers in your pairs.

The mapping for two maximum most 16 bit signed integers (32767, 32767) will be 2147418112 which is just short of maximum value for signed 32 bit integer.

Other option is create your own indexer based in a B-Tree, or using a Open Source Search Engine Library like xapian, it is written in C++ and is fast and easy to use.

Xapian is a highly adaptable toolkit which allows developers to easily add advanced indexing and search facilities to their own applications.

chema989
  • 3,962
  • 2
  • 20
  • 33
  • 1
    I thought about that already, the problem is that each pair value can range from 0 to 1 Million. I figured I'll need 1 Million * 1 Million mapping possibilities which exceeds the `unsigned int` range, and a `unsigned long long` variable would take up 8 bytes again. – Mad A. Jun 14 '16 at 15:33
  • How about unordered_multimap? merge the 2 32-bit keys into a single 64-bit key and use unordered map with value of 32-bit keys. This avoid needing to construct the 2-d data structure. – Nick Brown Dec 09 '22 at 16:11