Grouping numbers based on occurrences?

Question

Given the following three sequences of numbers, I would like to figure out how to group the numbers to find the closest relations between them.

1,2,3,4
4,3,5
2,1,3
...

I'm not sure what the algorithm(s) I'm looking for are called, but we can see stronger relations with some of the numbers than with others.

These numbers appear together twice:

Together once:

So for example, we can see there must be a relationship between 1, 2, & 3 since they all appear together at least twice. You could also say that 3 & 4 are closely related since they also appear twice. However, the algorithm might pick [1,2,3] (over [3,4]) since it's a bigger grouping (more inclusive).

We can form any of the following groupings if we stick the numbers used most often together in a group:

[1,2,3] & [4,5]
[1,2]   & [3,4]   & [5]
[1,2]   & [3,4,5]
[1,2]   & [3,4]   & [5]

If duplicates are allowed, you could even end up with the following groups:

[1,2,3,4] [1,2,3] [3,4] [5]

I can't say which grouping is most "correct", but all four of these combos all find different ways of semi-correctly grouping the numbers. I'm not looking for a specific grouping - just a general cluster algorithm that works fairly well and is easy to understand.

I'm sure there are many other ways to use the occurrence count to group them as well. What would be a good base grouping algorithm for these? Samples in Go, Javascript, or PHP are preferred.

I see two votes to close this question because it is too broad. May I ask what is broad about this question? I'm not sure how to simplify this task any farther. — Xeoncross, Apr 18 '15 at 17:59
It's called correlation clustering. Create a graph with numbers 1 .. 5 as nodes, weight the edges by number of times a pair appears together. I'm sure there are algorithms out there, but it's not such a tidy and well-defined problem. — Edward Doolittle, Apr 18 '15 at 18:07
I don't follow how you arrive at the final output. You say it 'could produce any of the following groupings'; which one is most correct and why? Do you really want an algorithm that could arbitrarily produce one of 4 conflicting results? — Patrick M, Apr 20 '15 at 20:10
I think this is an interesting question, but it's very unclear. Your input appears to be a series of lists of integers of varying size. What exactly is the output you are looking for? A number tree, another series of lists, how many times can each number appear in the output, etc.? — Elliot Nelson, Apr 21 '15 at 22:29
You need to specify your input and desired output a bit more formally before any useful answer can be given. — Asad Saeeduddin, Apr 21 '15 at 22:35
In your last example, can you clarify what exactly do you mean by the `,` operator? You don't seem to be using it to mean what it usually mean. It looks like you're using `,` to mean OR which is usually represented by the symbol `|` or `+` — slebetman, Apr 22 '15 at 07:01
I updated the example. Since I do not know which algorithm to use - I cannot know what the "correct" result should be so I listed several examples. I'm not looking for an answer - *I'm looking for a method*. I would like to find a decent cluster algorithm that could give me fuzzy grouping of any kind to help make sense of the numbers and their relations to each other. — Xeoncross, Apr 22 '15 at 14:32
@Xeoncross Several people have asked you to clarify what your output means, but you've been ignoring these questions. Could you please clarify your notation? What do `[]` `,` and `&` mean here? Even if you don't know what the exact metric is by which the numbers are grouped, could you at least give a concise plain English description of what property of the input you're trying to measure? "Find the closest relations between them" is too vague. — Asad Saeeduddin, Apr 22 '15 at 15:38
Sorry, `&` means "and" while a `,` is a comma and is used when separating items in an array/list `[1,2,3]`. Square brackets are used to mark a group/list/array of items `[1....6]`. So when I said `[1,2,3] & [4,5]` I meant *"two groups of items, the first containing 1, 2, & 3 and the second group containing 4 & 5."* — Xeoncross, Apr 22 '15 at 16:32
@Xeoncross Ok, so how did you arrive at that grouping? Put another way, what question about the input is `[1,2,3] & [4,5]` an answer to? — Asad Saeeduddin, Apr 22 '15 at 16:48

Michael Laszlo · Answer 1 · 2015-04-22T06:52:21.057

Each of your three sequences can be understood as a clique in a multigraph. Within a clique, every vertex is connected to every other vertex.

The following graph represents your sample case with the edges in each clique colored red, blue, and green, respectively.

Multigraph with five vertices and three cliques

As you have already shown, we can classify pairs of vertices according to the number of edges between them. In the illustration, we can see that four pairs of vertices are connected by two edges each, and four other pairs of vertices are connected by one edge each.

We can go on to classify vertices according to the number of cliques in which they appear. In some sense we are ranking vertices according to their connectedness. A vertex that appears in k cliques can be thought of as connected to the same degree as other vertices that appear in k cliques. In the image, we see three groups of vertices: vertex 3 appears in three cliques; vertices 1, 2, and 4 each appear in two cliques; vertex 5 appears in one clique.

The Go program below computes the edge classification as well as the vertex classification. The input to the program contains, on the first line, the number of vertices n and the number of cliques m. We assume that the vertices are numbered from 1 to n. Each of the succeeding m lines of input is a space-separated list of vertices belonging to a clique. Thus, the problem instance given in the question is represented by this input:

The corresponding output is:

Number of edges between pairs of vertices:
    2 edges: (1, 2) (1, 3) (2, 3) (3, 4)
    1 edge:  (1, 4) (2, 4) (3, 5) (4, 5)

Number of cliques in which a vertex appears:
    3 cliques: 3
    2 cliques: 1 2 4
    1 clique:  5

And here is the Go program:

package main

import (
        "bufio"
        "fmt"
        "os"
        "strconv"
        "strings"
)

func main() {
        // Set up input and output.
        reader := bufio.NewReader(os.Stdin)
        writer := bufio.NewWriter(os.Stdout)
        defer writer.Flush()

        // Get the number of vertices and number of cliques from the first line.
        line, err := reader.ReadString('\n')
        if err != nil {
                fmt.Fprintf(os.Stderr, "Error reading first line: %s\n", err)
                return
        }
        var numVertices, numCliques int
        numScanned, err := fmt.Sscanf(line, "%d %d", &numVertices, &numCliques)
        if numScanned != 2 || err != nil {
                fmt.Fprintf(os.Stderr, "Error parsing input parameters: %s\n", err)   
                return
        }

        // Initialize the edge counts and vertex counts.
        edgeCounts := make([][]int, numVertices+1)
        for u := 1; u <= numVertices; u++ {
                edgeCounts[u] = make([]int, numVertices+1)
        }
        vertexCounts := make([]int, numVertices+1)

        // Read each clique and update the edge counts.
        for c := 0; c < numCliques; c++ {
                line, err = reader.ReadString('\n')
                if err != nil {
                        fmt.Fprintf(os.Stderr, "Error reading clique: %s\n", err)
                        return
                }
                tokens := strings.Split(strings.TrimSpace(line), " ")
                clique := make([]int, len(tokens))
                for i, token := range tokens {
                        u, err := strconv.Atoi(token)
                        if err != nil {
                                fmt.Fprintf(os.Stderr, "Atoi error: %s\n", err)
                                return
                        }
                        vertexCounts[u]++
                        clique[i] = u
                        for j := 0; j < i; j++ {
                                v := clique[j]
                                edgeCounts[u][v]++
                                edgeCounts[v][u]++
                        }
                }
        }

        // Compute the number of edges between each pair of vertices.
        count2edges := make([][][]int, numCliques+1)
        for u := 1; u < numVertices; u++ {
                for v := u + 1; v <= numVertices; v++ {
                        count := edgeCounts[u][v]
                        count2edges[count] = append(count2edges[count],
                                []int{u, v})
                }
        }
        writer.WriteString("Number of edges between pairs of vertices:\n")
        for count := numCliques; count >= 1; count-- {
                edges := count2edges[count]
                if len(edges) == 0 {
                        continue
                }
                label := "edge"
                if count > 1 {
                        label += "s:"
                } else {
                        label += ": "
                }
                writer.WriteString(fmt.Sprintf("%5d %s", count, label))
                for _, edge := range edges {
                        writer.WriteString(fmt.Sprintf(" (%d, %d)",
                                edge[0], edge[1]))
                }
                writer.WriteString("\n")
        }

        // Group vertices according to the number of clique memberships.
        count2vertices := make([][]int, numCliques+1)
        for u := 1; u <= numVertices; u++ {
                count := vertexCounts[u]
                count2vertices[count] = append(count2vertices[count], u)
        }
        writer.WriteString("\nNumber of cliques in which a vertex appears:\n")
        for count := numCliques; count >= 1; count-- {
                vertices := count2vertices[count]
                if len(vertices) == 0 {
                        continue
                }
                label := "clique"
                if count > 1 {
                        label += "s:"
                } else {
                        label += ": "
                }
                writer.WriteString(fmt.Sprintf("%5d %s", count, label))
                for _, u := range vertices {
                        writer.WriteString(fmt.Sprintf(" %d", u))
                }
                writer.WriteString("\n")
        }
}

*"In some sense we are ranking vertices according to their connectedness"*... While this is good related research - it doesn't actually tackle the problem of grouping the input - just calculating meta-data about the input which *could help* in actual grouping/clustering. It's a great start (+1), but [consider the following input](http://pastie.org/10107600). — Xeoncross, Apr 22 '15 at 15:17

Uvelichitel · Accepted Answer · 2015-04-25T17:38:53.593

As already been mentioned it's about clique. If you want exact answer you will face Maximum Clique Problem which is NP-complete. So all below make any sense only if alphabet of your symbols(numbers) has reasonable size. In this case strait-forward, not very optimised algorithm for Maximum Clique Problem in pseudo-code would be

Function Main
    Cm ← ∅                   // the maximum clique
    Clique(∅,V)              // V vertices set
    return Cm
End function Main

Function Clique(set C, set P) // C the current clique, P candidat set
    if (|C| > |Cm|) then
        Cm ← C
    End if
    if (|C|+|P|>|Cm|)then
        for all p ∈ P in predetermined order, do
            P ← P \ {p}
            Cp ←C ∪ {p}
            Pp ←P ∩ N(p)        //N(p) set of the vertices adjacent to p
            Clique(Cp,Pp)
        End for
    End if
End function Clique

Because of Go is my language of choice here is implementation

package main

import (
    "bufio"
    "fmt"
    "sort"
    "strconv"
    "strings"
)

var adjmatrix map[int]map[int]int = make(map[int]map[int]int)
var Cm []int = make([]int, 0)
var frequency int


//For filter
type resoult [][]int
var res resoult
var filter map[int]bool = make(map[int]bool)
var bf int
//For filter


//That's for sorting
func (r resoult) Less(i, j int) bool {
    return len(r[i]) > len(r[j])
}

func (r resoult) Swap(i, j int) {
    r[i], r[j] = r[j], r[i]
}

func (r resoult) Len() int {
    return len(r)
}
//That's for sorting


//Work done here
func Clique(C []int, P map[int]bool) {
    if len(C) >= len(Cm) {

        Cm = make([]int, len(C))
        copy(Cm, C)
    }
    if len(C)+len(P) >= len(Cm) {
        for k, _ := range P {
            delete(P, k)
            Cp := make([]int, len(C)+1)
            copy(Cp, append(C, k))
            Pp := make(map[int]bool)
            for n, m := range adjmatrix[k] {
                _, ok := P[n]
                if ok && m >= frequency {
                    Pp[n] = true
                }
            }
            Clique(Cp, Pp)

            res = append(res, Cp)
            //Cleanup resoult
            bf := 0
            for _, v := range Cp {
                bf += 1 << uint(v)
            }
            _, ok := filter[bf]
            if !ok {
                filter[bf] = true
                res = append(res, Cp)
            }
            //Cleanup resoult
        }
    }
}
//Work done here

func main() {
    var toks []string
    var numbers []int
    var number int


//Input parsing
    StrReader := strings.NewReader(`1,2,3
4,3,5
4,1,6
4,2,7
4,1,7
2,1,3
5,1,2
3,6`)
    scanner := bufio.NewScanner(StrReader)
    for scanner.Scan() {
        toks = strings.Split(scanner.Text(), ",")
        numbers = []int{}
        for _, v := range toks {
            number, _ = strconv.Atoi(v)
            numbers = append(numbers, number)

        }
        for k, v := range numbers {
            for _, m := range numbers[k:] {
                _, ok := adjmatrix[v]
                if !ok {
                    adjmatrix[v] = make(map[int]int)
                }
                _, ok = adjmatrix[m]
                if !ok {
                    adjmatrix[m] = make(map[int]int)
                }
                if m != v {
                    adjmatrix[v][m]++
                    adjmatrix[m][v]++
                    if adjmatrix[v][m] > frequency {
                        frequency = adjmatrix[v][m]
                    }
                }

            }
        }
    }
    //Input parsing

    P1 := make(map[int]bool)


    //Iterating for frequency of appearance in group
    for ; frequency > 0; frequency-- {
        for k, _ := range adjmatrix {
            P1[k] = true
        }
        Cm = make([]int, 0)
        res = make(resoult, 0)
        Clique(make([]int, 0), P1)
        sort.Sort(res)
        fmt.Print(frequency, "x-times ", res, " ")
    }
    //Iterating for frequency of appearing together
}

And here you can see it works https://play.golang.org/p/ZiJfH4Q6GJ and play with input data. But once more, this approach is for reasonable size alphabet(and input data of any size).

This looks like the answer I was looking for. It's a little too inclusive perhaps (allowing `1&2` to show up in 3x, 2x, and 1x clique) but it is a very nice algorithm. — Xeoncross, Apr 23 '15 at 15:04
@Xeoncross I added kind of bloom filter to cleanup a bit https://play.golang.org/p/ZiJfH4Q6GJ And of course you may choose your own representation of results. — Uvelichitel, Apr 25 '15 at 17:33

score 3 · Answer 3 · answered Apr 23 '15 at 08:14

This problem often arises in the context of rule mining when analyzing sales data. (Which items are bought together? So they can be placed next to each other in the supermarket)

One class of algorithms I came across is Association Rule Learning. And one inherent step is finding frequent itemsets which matches your task. One algorithm is Apriori. But you can find a lot more when searching for those keywords.

skazska · Answer 4 · 2015-04-27T10:46:17.883

It would be better it you describe goal of such a grouping. If no i may try to suggest the simples (as i think) approach, and thow most limeted. It is not suitable if you need to count huge amount of wide spreaded (like 1, 999999, 31) or big or nonpositive numbers . you can rearrange number sets in array positions like so:

  |1|2|3|4|5|6|  - numers as array positions
==============
*1|1|1|1|1|0|0| *1     
*2|0|0|1|1|1|0| *2   
*4|1|1|1|0|0|0| *4
==============
 +|2|2|3|2|1|0  - just a counters of occurence
 *|5|5|7|3|2|0  - so for first column number 1 mask will be: 1*1+1*4 = 5

here you can see in + row that most frequent combination is [3], then [1,2] and [4] and then [5], also you can indicate and distinguish the cooccurence of different combinations

    function grps(a) {
      var r = [];
      var sum = []; var mask = [];
      var max = 0;
      var val;
      for (i=0; i < a.length; i++) {
        for (j=0; j < a[i].length; j++) {
          val = a[i][j]; 
          //r[i][val] = 1;
          sum[val] = sum[val]?sum[val]+1:1; 
          mask[val] = mask[val]?mask[val]+Math.pow(2, i):1;
          if (val > max) { max = val; }
        }
      }
      for (j = 0; j < max; j++){
        for (i = 0; i < max; i++){            
          r[sum[j]][mask[j]] = j;
        }
      }
      return r;
    }

Grouping numbers based on occurrences?

4 Answers4