-1

I wrote a program that reads numbers out of a file (about 500,000 of them), and inserts them to a data structure. the numbers are distinct. I'm inserting the numbers to an unordered_map with another struct (using std::make_pair(myNumber, emptyStruct)).

And after the insertion of all the numbers, I'm using it to search only a couple of hundred times. I never delete the DS until the program finish running.

After profiling, I've noticed that the insert operation takes about 50% of the running time. (There is also some other code, that runs as many times as the insertion, but it doesn't take so much time).

I thought maybe the resizing takes time, so I used the reserve function with 500,000, but the results are still the same.

As far as I know, this DS should be O(1) insert and search (and the trade off is large memory), so I don't see why it takes so much time to insert. How can I improve my results?

James Jones
  • 3,850
  • 5
  • 25
  • 44
  • 1
    That's O(1) *for each insertion*. n insertions still are O(n). – Baum mit Augen Oct 30 '16 at 22:40
  • 1
    I agree. It seems reasonable. Inserting is going to be expensive. How about doing it backwards: load the values to compare first, and then go over the input file. – dmg Oct 30 '16 at 22:41
  • 1
    Well, you could do more other processing besides inserting into the `unordered_map` that should bring the 50% portion down. How much exactly is "too much time"? What would be an appropriate amount of time for inserting 500,000 elements into a map? – eerorika Oct 30 '16 at 22:44
  • 2
    Have you considered using a vector? Insert them all, then sort the vector, then use `binary_search` to search them. – Marshall Clow Oct 30 '16 at 23:53
  • Unorded_map uses a hash function for inserting. This is why it is usually slow at insertion time and fast at finding. You are doung a lot of insertions and a frew reads so std::map might be a batter solution. See http://stackoverflow.com/questions/2196995/is-there-any-advantage-of-using-map-over-unordered-map-in-case-of-trivial-keys – Dragos Pop Oct 30 '16 at 23:58
  • Regarding how much time is a good time.. For every value I insert, before I insertion, I call an api function - WriteProcessMemory, a function which is also time consuming, but not as much as the insertion (profiling gives me 10% on this api, and 50% on insertion). I don't believe it's a reasonable ratio, and I think better results could be achieved. – user7092994 Oct 31 '16 at 06:35

2 Answers2

1

Unordered maps are implemented with a hash table. It has amortised constant insertion time. Reserving size to the map helps, but not by too much. There is not much better you can do in terms of insertions to it.

This means that you might be able to shave some time, but it is only going to be marginal. For instance, inserting into a vector is slightly faster, but it is also amortized constant time. So you will shave some seconds in the insertion at the cost of the search.

This is where a database helps. Say you have the data in a sqlite database instead. You create the database, create the table with the search value as its primary key, and the data value as its other attribute, insert the values into a table once. Now, the program simply runs and queries the database. It only reads the minimum necessary. In this case, the sqlite database takes the role of the unordered map you are using.

dmg
  • 4,231
  • 1
  • 18
  • 24
-2

Since you are specifically not using a value, and merely searching for existence, go with std::unordered_set. It does what you wanted when you made a dummy value to go with every key in the map.

First, I want to re-iterate what everyone said: inserting 500,000 items to use it a few hundred times is going to take up a sizable chunk of your time, and you can't really avoid that, unless you can turn it around -- build a set of the things you are searching for, then search that 500,000 times.

All that said, I was able to get some improvement on the insertion of 500,000 items in a test app, by taking into account the nature of hash tables:

Reviewing http://en.cppreference.com/w/cpp/container/unordered_map, I found these:

[Insert] Complexity: Average case: O(1), worst case O(size())

By default, unordered_map containers have a max_load_factor of 1.0.

When you reserve space for 500000 items, you get 500000 buckets. If you put 500000 pieces of data in 500000 buckets, you are going to get a lot of collisions. I reserved extra space, and it was faster.

If you really need speed, and are willing to get some errors, look into bloom filters.

Community
  • 1
  • 1
Kenny Ostrom
  • 5,639
  • 2
  • 21
  • 30