2

I have an array consisting of 1.5 millions pairs of elements (separated by ' '):

$array {
    [0] => "element1 element2"
    [1] => "element2 element3"
    [2] => "element8 element4"
    [3] => "element8 element5"
    [4] => "element4 element5"
    [5] => "element6 element7"
    [6] => ... 
}     

Each pair of element is unique, and elements are strings of 15 to 20 characters.

In my pipeline, this array means [0] "element1 is related to element2",[1] "element2 is related to element3", ... I would like to cluster together all related elements and get an output similar to:

 $array_output {
      [0] => "element1 element2 element3"
      [1] => "element8 element4 element5"
      [2] => "element6 element7"
      [3] => ... 
 }  

I guess this task is very simple and I'm probably missing an obvious way to do it, but I didn't find a fast way to cluster my elements (i.e from a few minutes to a few hours).

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
zwonoROM
  • 29
  • 1
  • I don't consider this task simple and don't know of an obvious way to do it. I would probably suggest exploding on space and then creating a nested hierarchy structure. Then write something to flatten that structure into the desired groups. – Jonathan Kuhn Feb 25 '15 at 17:06
  • 2
    I'd be highly disinclined to do this in PHP memory with such a large number of pairs, and handle it on a database instead – Mark Baker Feb 25 '15 at 17:18
  • I don't think it's as problematic as that. Unless I've misunderstood the question it can be done in O(n) time and space, where n is the number of pairs in the input (see my answer). – gandaliter Feb 25 '15 at 17:39
  • "Fast ... PHP" you must be joking. PHP clearly does not have the reputation of being fast, in particular not when you have complex algorithms and data structures. – Has QUIT--Anony-Mousse Feb 25 '15 at 22:49
  • Also, your problem isn't well defined. Do you want the **connected components** or the **cliques**? These need quite different algorithms (but you'll find neither as "clustering") – Has QUIT--Anony-Mousse Feb 25 '15 at 22:51

1 Answers1

0

You have a graph represented as an adjacency list, and you want to convert it to a list of the connected components of the graph. The best way to do this is to build sets of nodes which are connected, and merge them for each edge until you have no more edges.

To do this in PHP:

  1. Convert your input to a multidimensional array ([["element1", "element2"],["element2","element3"]] etc.)
  2. Initialise a node list in a map representation with each node pointing to a set containing just that node (e.g. ["element1" => ["element1"],"element2" => ["element2"]] etc.)
  3. For each pairing in the array from (1) merge the sets of the two elements in the array from (2), and point both elements, as well as any other elements within the set, to the newly merged set
  4. Put all the sets from (3) into a set (of sets), so you get each one only once
  5. Convert each set to your desired output format

You will want to use the reference operator (&) in order to re-use the same arrays in (3). The algorithm would be much easier to implement in Java or something with more obvious hashmaps and hashtables.

gandaliter
  • 9,863
  • 1
  • 16
  • 23
  • Thanks a lot for your suggestion. But I'm afraid I don't get your point 2 'initialise a map representation'. Would you have an code example ? Thanks. – zwonoROM Feb 25 '15 at 18:28
  • The idea of the `[]`s was to show arrays (I know that's not what they look like in PHP but it's easier to see). All you're really doing at step 2 is producing a node list. Each of the elements should be in it exactly once, and independently of any pairings. – gandaliter Feb 25 '15 at 20:59