combining list of strings with a common element

Question

Let's say I have a vector of a hashset of strings:

Vector<HashSet<String>> strSetVector = new Vector<HashSet<String>>();

I have 4 hashsets containing the following strings:

"A", "B"
"C", "D"
"B", "C"
"E", "F"

I want to combine the sets that have at least one common value so that I end up with:

"A", "B","C", "D"
"E", "F"

The obvious solution is to iterate multiple times thru the vector and each hashset to find common values but this will take a while to process with a vector size of 1000+ and HashSets of size up to 100. I would also have to go thru the process again if i merge a hashset to see if there are now other hashsets that can be merged. For example, first vector iteration would combine B,C to A,B so that I would end up with:

"A", "B", "C"
"C", "D"
"E", "F"

Next iteration of the vector/hashset:

"A", "B", "C", "D"
"E", "F"

Next iteration of the vector/hashset would not find any common strings so there would be nothing to merge and I would be done.

I would like a more elegant solution to what seems like a simple problem. Any ideas?

Im just using a vector out of convenience to keep a list of hashsets. I don't have to use a vector though. — Micho Rizo, Mar 08 '14 at 16:25
I would suggest you to use a `List`, resp. `ArrayList` then, as `Vector` is *synchronized*, whereas `List` is not. — skiwi, Mar 08 '14 at 16:27

score 2 · Accepted Answer · answered Mar 09 '14 at 14:05

I'm not sure if I understood everything correctly. And I think that the "best" solution might also depend on the sizes of the sets and the list. (That is, whether the list contains 10 sets where each set contains 100000 elements, or whether it contains 100000 sets where each set contains 10 elements).

But for the numbers mentioned so far (1000 sets with 100 elements), I think that one could use a comparatively simple solution:

Go through the list of sets
For each set, go through its elements
For each element, store a mapping from the element to the set that it is contained in
When an element is encountered which already has an associated set, then merge the current set and this existing set, and store a mapping from all the elements of this merged set to the merged set

This code snippet is based on the given example, and prints some debugging information, which might make the process clearer. It additionally stores a compactMap that maps the first element of a (potentially merged) set to the set itself, to have a representation of the sets where each set occurs only once.

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;
import java.util.LinkedHashMap;
import java.util.LinkedHashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;

public class MergeListsTest
{
    public static void main(String[] args)
    {
        List<Set<String>> sets = new ArrayList<Set<String>>();
        sets.add(new LinkedHashSet<String>(Arrays.asList("A", "B")));
        sets.add(new LinkedHashSet<String>(Arrays.asList("C", "D")));
        sets.add(new LinkedHashSet<String>(Arrays.asList("B", "C")));
        sets.add(new LinkedHashSet<String>(Arrays.asList("E", "F")));
        //sets.add(new LinkedHashSet<String>(Arrays.asList("D")));
        //sets.add(new LinkedHashSet<String>(Arrays.asList("D", "X")));
        //sets.add(new LinkedHashSet<String>());

        Collection<Set<String>> merged = computeMerged(sets);

        System.out.println("Resulting sets:");
        for (Set<String> s : merged)
        {
            System.out.println(s);
        }
    }

    private static <T> Collection<Set<T>> computeMerged(List<Set<T>> sets)
    {
        Map<T, Set<T>> compactMap = new LinkedHashMap<T, Set<T>>();
        Map<T, Set<T>> map = new LinkedHashMap<T, Set<T>>();
        for (Set<T> set : sets)
        {
            System.out.println("Handle set "+set);

            Set<T> combinedSet = new LinkedHashSet<T>(set);
            for (T t : set)
            {
                Set<T> innerSet = map.get(t);
                if (innerSet != null && !innerSet.isEmpty())
                {
                    System.out.println("Element "+t+" was previously mapped to "+innerSet);

                    T first = innerSet.iterator().next();
                    compactMap.remove(first);
                    combinedSet.addAll(innerSet);

                    System.out.println("Combined set is now "+combinedSet);
                }
            }
            if (!combinedSet.isEmpty())
            {
                System.out.println("Store a mapping from each element in "+combinedSet+" to this set");
                T first = combinedSet.iterator().next();
                compactMap.put(first, combinedSet);
                for (T t : combinedSet)
                {
                    map.put(t, combinedSet);
                }
            }
        }
        return compactMap.values();
    }

}

score 0 · Answer 2 · edited May 23 '17 at 12:29

Maybe I end up with an equivalent algorithm, but I would think of this as a graph, where you want to find the sets of connected components.

I would build the adjacency matrix, of size n x n (n total different elements, assumed 100 in this case), with which you can find the sets of connected elements with linear time complexity algorithms, as for example the one described in this answer, which includes Java code.

To build the adjacency matrix, you have to process every HashSet and connect its elements linearly, meaning O(n), and arbitrarily, for example the first with the second, the second with the third, ... (I don't think it is necessary to totally interconnect all of them).

The fact that your elements are Strings and not Chars can complicate things a little bit, but you can in beforehand create a HashMap (String, Integer) to map every element to a given cardinal, with an initial sweep linear in time. The subsequent lookups to this map should have time complexity O(1).

combining list of strings with a common element

2 Answers2

Linked