0

I have a very large file (10^8 lines) with counts of events as follows,

A 10
B 11
C 23
A 11

I need to accumulate the counts for each event, so that my map contains

A 21
B 11
C 23

My current approach:

Read the lines, maintain a map, and update the counts in the map as follows

updateCount(Map<String, Long> countMap, String key,
            Long c) {
        if (countMap.containsKey(key)) {
            Long val = countMap.get(key);
            countMap.put(key, val + c);
        } else {
            countMap.put(key, c);
        }
    }

Currently this is the slowest part of the code, (takes around 25 ms). Note that the map is based on MapDB, but I doubt that updates are slow due to that (are they?)

This is the mapdb configs for the map,

DBMaker.newFileDB(dbFile).freeSpaceReclaimQ(3)
                .mmapFileEnablePartial()
                .transactionDisable()
                .cacheLRUEnable()
                .closeOnJvmShutdown();

Are there ways to speed this up?

EDIT:

The number of unique keys is of the order of the pages in wikipedia. The data is actually page traffic data from here.

shyamupa
  • 1,528
  • 4
  • 16
  • 24
  • Just a note, if possible you may want to do your alterations in a `HashMap`. It will be the fastest for updating random map entries. – damian Aug 19 '14 at 15:19
  • Exactly what is your question? – user1071777 Aug 19 '14 at 15:19
  • 5
    Twenty-five milliseconds! Heavens to Murgatroyd! – David Conrad Aug 19 '14 at 15:21
  • @user1071777 I am looking for ways to speed this up. – shyamupa Aug 19 '14 at 15:23
  • The memory usage of your map depends on the number of unique keys you have, not the size of the input file. How many unique keys do you have? If it's less than a few tens of millions you should not be using MapDB, but just a plain `HashMap` in memory. That should be several orders of magnitude faster. – Jim Garrison Aug 19 '14 at 15:46
  • @JimGarrison I answered your question in a edit. – shyamupa Aug 19 '14 at 15:52
  • You will trade off speed for memory. If you want speed, add LOTS of memory and put the map in memory. If you can't fit the map in memory then it will be slow. This is one of the fundamental tradeoffs in computing. Take your pick. – Jim Garrison Aug 19 '14 at 15:54

3 Answers3

0

You might try

class Counter {
    long count;
}

void updateCount(Map<String, Counter> countMap, String key, int c) {
    Counter counter = countMap.get(key);
    if (counter == null) {
        counter = new Counter();
        countMap.put(key, counter);
        counter.count = c;
    } else {
        counter.count += c;
    }
}

This does not create many Long wrappers, but just allocates Counters the number of keys.

Note: do not create Long's. Above I made c an int to not oversee long/Long.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
0

As a starting point, I'd suggest thinking about:

  • What is yardstick by which you're saying that 25ms is actually an unreasonable amount of time for the amount of data involved and for a generic map implementation? if you quantify that, it might help you work out if there is anything wrong.
  • How much time is being spent re-hashing the map versus other operations (e.g. calculation of hash codes on each put)?
  • What do your "events" as you call them consist of? How many unique events-- and hence unique keys-- are there? How are keys to the map being generated, and is there a more efficient way to do so? (In a standard hash map, for example, you create additional objects for each association, and actually store the key objects increasing the memory footprint.)
  • Depending on the answers to the previous, you could potentially roll a more efficient map structure yourself (see this example that you might be able to adapt). Essentially, you need to look specifically at what is taking the time (e.g. hash code calculation per put / cost of rehashing) and try and optimise that part.
Neil Coffey
  • 21,615
  • 7
  • 62
  • 83
0

If you are using a TreeMap, there are performance tuning options like

  1. The number of entries in each node.
  2. You could also use specific key and value serializer that will speed up the serialization and de-serilization.
  3. You could use Pump mode to build the tree, which is very very fast. But one caveat is that this is useful when you are building a new map from scratch. You can find the full example here

https://github.com/jankotek/MapDB/blob/master/src/test/java/examples/Huge_Insert.java

Shishya
  • 1,069
  • 1
  • 14
  • 22