0

First of all let me tell you that i have read the following questions that has been asked before Java HashMap performance optimization / alternative and i have a similar question.

What i want to do is take a LOT of dependencies from New york times text that will be processed by stanford parser to give dependencies and store the dependencies in a hashmap along with their scores, i.e. if i see a dependency twice i will increment the score from the hashmap by 1.

The task starts off really quickly, about 10 sentences a second but scales off quickly. At 30 000 sentences( which is assuming 10 words in each sentence and about 3-4 dependences for each word which im storing) is about 300 000 entries in my hashmap.

How will i be able to increase the performance of my hashmap? What kind of hashkey can i use?

Thanks a lot Martinos

EDIT 1:

ok guys maybe i phrased my question wrongly ok , well the byte arrays are not used in MY project but in the similar question of another person above. I dont know what they are using it for hence thats why i asked.

secondly: i will not post code as i consider it will make things very hard to understand but here is a sample:

With sentence : "i am going to bed" i have dependencies: (i , am , -1) (i, going, -2) (i,to,-3) (am, going, -1) . . . (to,bed,-1) These dependencies of all sentences(1 000 000 sentences) will be stored in a hashmap. If i see a dependency twice i will get the score of the existing dependency and add 1.

And that is pretty much it. All is well but the rate of adding sentences in hashmap(or retrieving) scales down on this line: dependancyBank.put(newDependancy, dependancyBank.get(newDependancy) + 1); Can anyone tell me why? Regards Martinos

Community
  • 1
  • 1
Martinos
  • 119
  • 1
  • 1
  • 10
  • 2
    It would really help if you could show more code... what are the types involved, for example? 10 sentences per second sounds very slow... – Jon Skeet Apr 01 '12 at 19:26
  • Please consider removing the extra question at the end, it would be more suited as a comment in the relevant question. – GavinCattell Apr 01 '12 at 19:26
  • You can't use a `byte[]` as a key, so I wonder what you could be using it for. `byte[]` is an object, you cannot place a primitive into a HashMap (you can only add wrappers) – Peter Lawrey Apr 01 '12 at 19:29
  • This is massively unclear without more details. – Louis Wasserman Apr 01 '12 at 19:36

5 Answers5

3

Trove has optimized hashmaps for the case where key or value are of primitive type.

However, much will still depend on smart choice of structure and hash code for your keys.

This part of your question is unclear: The task starts off really quickly, about 10 sentences a second but scales off quickly. At 30 000 sentences( which is assuming 10 words in each sentence and about 3-4 dependences for each word which im storing) is about 300 000 entries in my hashmap.. But you don't say what the performance is for the larger data. Your map grows, which is kind of obvious. Hashmaps are O(1) only in theory, in practice you will see some performance changes with size, due to less cache locality, and due to occasional jumps caused by rehashing. So, put() and get() times will not be constant, but still they should be close to that. Perhaps you are using the hashmap in a way which doesn't guarantee fast access, e.g. by iterating over it? In that case your time will grow linearly with size and you can't change that unless you change your algorithm.

Michał Kosmulski
  • 9,855
  • 1
  • 32
  • 51
  • 1
    In 2017 Trove is unsupported and has a lot of bugs (always had). fastutil, Koloboke and Eclipse collections are better alternatives. – leventov Jan 19 '17 at 16:32
2

Google 'fastutil' and you will find a superior solution for mapping object keys to scores.

bmargulies
  • 97,814
  • 39
  • 186
  • 310
0

An HashMap has an overloaded constructor which takes initial capacity as input. The scale off you see is because of rehashing during which the HashMap will virtually not be usable. To prevent frequent rehashing you need to start with a HashMap of greater initial capacity. You can also set a loading factor which indicates how much percentage do you load the hashes before rehashing.

public HashMap(int initialCapacity).

Pass the initial capacity to the HashMap during object construction. It is preferable to set a capacity to almost twice the number of elements you would want to add in the map during the course of execution of your program.

0

Take a look at the Guava multimaps: http://www.coffee-bytes.com/2011/12/22/guava-multimaps They are designed to basically keep a list of things that all map to the same key. That might solve your need.

Rick Mangi
  • 3,761
  • 1
  • 14
  • 17
0

How will i be able to increase the performance of my hashmap?

If its taking more than 1 micro-second per get() or put(), you have a bug IMHO. You need to determine why its taking as long as it is. Even in the worst case where every object has the same hasCode, you won't have performance this bad.

What kind of hashkey can i use?

That depends on the data type of the key. What is it?

and finally what are byte[] a = new byte[2]; byte[] b = new byte[3]; in the question that was posted above?

They are arrays of bytes. They can be used as values to look up but its likely that you need a different value type.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130