2

Given a list in which each entry is a object that looks like

class Entry {
    public String id;
    public Object value;
}

Multiple entries could have the same id. I need a map where I can access all values that belong to a certain id:

Map<String, List<Object>> map;

My algorithm to achieve this:

for (Entry entry : listOfEntries) {
    List<Object> listOfValues;
    if (map.contains(entry.id)) {
        listOfValues = map.get(entry.id);
    } else {
        listOfValues = new List<Object>();
        map.put(entry.id, listOfValues);
    }
    listOfValues.add(entry.value);
}

Simply: I transform a list that looks like

ID | VALUE
---+------------
a  | foo
a  | bar
b  | foobar

To a map that looks like

a--+- foo 
   '- bar
b---- foobar

As you can see, contains is called for each entry of the source list. That's why I wonder if I could improve my algorithm, if I pre-sort the source list and then do this:

List<Object> listOfValues = new List<Object>();
String prevId = null;
for (Entry entry : listOfEntries) {
    if (prevId != null && prevId != entry.id) {
        map.put(prevId, listOfValues);
        listOfValues = new List<Object>();
    }
    listOfValues.add(entry.value);
    prevId = entry.id;
}
if (prevId != null) map.put(prevId, listOfValues);

The second solution has the advantage that I don't need to call map.contains() for every entry but the disadvantage that I have to sort before. Futhermore the first algorithm is easier to implement and less error prone, since you have to add some code after the actual loop.

Therefore my question is: Which method has better performance?

The examples are written in Java pseudo code but the actual question applies to other programming languages as well.

fishbone
  • 3,140
  • 2
  • 37
  • 50
  • 2
    Without actually answering your question, your data structure is call a multimap. You can get what you need with the help of Guava's [`TreeMultimap`](http://google.github.io/guava/releases/snapshot/api/docs/com/google/common/collect/TreeMultimap.html) and/or with [`MultimapBuilder`](http://google.github.io/guava/releases/snapshot/api/docs/com/google/common/collect/MultimapBuilder.html). – Petr Janeček Jul 27 '16 at 08:37
  • @Sorin's answer is largely correct. On the performance part, I have met similar problems myself. In my case (integer id; billions of entries; many duplicated ids), the second approach is significantly faster because sorting is cache efficient and associated with a tiny constant. In your case, however, sorting strings offsets the cache efficiency of sort; large `Object` may also reduce sorting performance a bit. If in addition you don't have many duplicated ids, the first approach may be faster. I can't say for sure, though. – user172818 Jul 27 '16 at 22:23

3 Answers3

2

If you have a hash map and a very large amount of entries then inserting items one by one will be faster than sorting and inserting them list by list (O(n) vs O(N log N)). If you use a tree based map than the complexity is the same for both approaches.

However, I really doubt you have a sufficiently large amount of entries so memory access patterns, and how fast compare and hash functions are come into effect. You have 2 options: ignore it since the difference is not going to be significant or benchmark both options and see which one is working better on your system. If you don't have millions of entries I would ignore the issue and go with whatever is easier to understand.

Sorin
  • 11,863
  • 22
  • 26
0

Don't presort. Even fast sorting algorithms like quicksort take, on average, O(n log n) for n items. Afterwards, you still need O(n) to walk the list. contains on a (hash) map takes constant time (checkout this question), don't worry about it. Walk the list in linear time and use contains.

Community
  • 1
  • 1
beatngu13
  • 7,201
  • 6
  • 37
  • 66
  • _"you still need O(n) to walk the list"_ Do you? If you presorted while adding, you could then use binary search, effectively reducing the linear probe to O(log n). In Java 8's `HashMap`, that's how `Comparable` values are stored when they fall into the same hash bucket. – Petr Janeček Jul 27 '16 at 09:09
  • @Slanec O(n*log(n))+O(log n) is still greater than O(n)+O(n). – Vesper Jul 27 '16 at 11:26
  • @Slanec: I was referring to "Therefore my question is: Which method has better performance?" Since the OP is using a simple foreach loop in both cases, it is O(n). However, what do you mean by "[…] presorted while adding, you could then use binary search […]"? How would you use binary search when you have to look at each value? – beatngu13 Jul 27 '16 at 12:47
  • @Vesper: It's O(n log n) + O(log n) = O(n log n) vs. O(n), you don't have to walk the list twice. – beatngu13 Jul 27 '16 at 12:50
0

Would like to offer another solution using streams

import static java.util.stream.Collectors.groupingBy;
import static java.util.stream.Collectors.mapping;
import static java.util.stream.Collectors.toList;

Map<String, List<Object>> map = listOfValues.stream()
    .collect(groupingBy(entry -> entry.id, mapping(entry -> entry.value, toList())));

This code is more declarative - it only specifies that List should be transformed into Map. Then it is a library responsibility to actually perform transformation in efficient way.

Nazarii Bardiuk
  • 4,272
  • 1
  • 19
  • 22
  • While I prefer declarative code as well, I disagree with you on "Then it is a library responsibility to actually perform transformation in efficient way". Sure, it's up to the library implementers that a given operation performs well. But in case of Java's `groupingBy` "There are no guarantees on the type, mutability, serializability, or thread-safety of the Map returned". So if it comes to performance, I'd say this are properties _you_ want to control and, therefore, _you_ have to take care of. – beatngu13 Jul 27 '16 at 20:03