3

I have the following complete weighted graph, where each weight represents the probability of a vertice belonging to the same category as the next. I know a priori the category for which some of the vertices belong to; how would I be able to classify every other vertice?

image

In a more detailed manner I can describe the problem as following; From all the vertices N and clusters C, we have a set where we know for sure the specific cluster which a node belongs: P(v_n|C_n)=1. From the graph given we also know for each node, the probability of every other belonging to the same cluster as it: P(v_n1∩C_n2). From this, how can we estimate the cluster for every other node?

  • What are the probabilities out of (100%?) Do you have some way of checking whether your answer is correct? Do you want one clustering, or is the goal to sample from probable clusterings? – templatetypedef Jan 04 '16 at 21:41
  • Seems like a variant of [page rank](https://en.wikipedia.org/wiki/PageRank) (or weighted page rank) After n->infinity steps, the probability a random surfer will be at a vertex is (I think?) the probability a vertex is in the same class. – amit Jan 04 '16 at 21:50
  • It sounds like you want something like a [multiway cut](https://courses.engr.illinois.edu/cs598csc/sp2009/lectures/lecture_7.pdf). – David Eisenstat Jan 05 '16 at 00:22
  • Not so bad question for the first time. +1 and welcome here. But beware: your task is not defined, it lacks large pieces of info. You'd better add them. – Gangnus Jan 05 '16 at 08:51
  • @Gangnus Thank you. I wanted to keep the question as brief as possible. I'm still reading your response, but in the meanwhile I will add a bit more detail as requested. – Eduardo Gonçalves Jan 07 '16 at 12:35
  • @templatetypedef We can assume so. The graph is just one that I found that can serve as an example. The point is to classify every other vertice based on the given set of nodes where we know the cluster each one belongs to. I have edited the question to include a bit more detail on this. – Eduardo Gonçalves Jan 07 '16 at 13:09

2 Answers2

1

Let w_i be a vector where w_i[j] is the probability of node j, being in the cluser, at iteration i.

We define w_i:

w_0[j] =  1       j is given node in the class
          0       otherwise
w_{i}[j] = P(j | w_{i-1})

Where: P(j | w_{i-1}) is the probability j being in the cluster, assuming we know the probabilities for each other node k to be in it, as w_{i-1}[k].

We can calculate the above probability:

P(j | w_{i-1}) = 1- (1- w_{i-1}[0]*c(0,j))*(1- w_{i-1}[1]*c(1,j))*...*(1- w_{i-1}[n-1]*c(n-1,j))

in here:

  • w_{i-1} is the output of last iteration.
  • c(x,y) is the weight of edge (x,y)
  • c(x,x) = 1

Repeat until convergence, and in the converged vector (let it be w), the probability of j being in the cluster is w[j]


Explanation for the probability function:

In order for a node NOT to be in the set, it needs all the other nodes will "decide" not to share it.
So, the probability for that happening is:

(1- w_{i-1}[0]*c(0,j))*(1- w_{i-1}[1]*c(1,j))*...*(1- w_{i-1}[n-1]*c(n-1,j))
      ^                            ^                       ^
node 0 doesn't share      node 1 doesn't share     node n-1 doesn't share

In order to be in the class, at least one node need to "share", so the probability for that happening is the complemantory, which is the formula we derived for P(j | w_{i-1})

amit
  • 175,853
  • 27
  • 231
  • 333
  • Making iterations without knowing if the solution exists of is stable, is absolutely senseless. – Gangnus Jan 05 '16 at 08:48
  • @amit Thanks for the detailed response. I'm still trying to understand the proposed solutions. Please tell me if I'm wrong; if we know that node n is in the cluster: P(v_n|C_n)=1, the probability of node j being in the cluster of n: P(v_{n+j}|C_n)=1*P(v_{n+1}∩C_n)*P(v_{n+2}∩C_ {n+1})*P(v_{n+3}∩C_{n+2})... ; where P(v_x∩C_y) is what you call c(x, y). What is the reason behind this? – Eduardo Gonçalves Jan 07 '16 at 12:20
  • @amit To clarify my last comment, my question is weather what I stated is an implication from your distribution function. Also, if I have N nodes, which for each one I know their cluster a priori, I will have different values depending from the specific node I start iterating from, how would you solve that? Additionally, my assumption is that you start with the already classified nodes and iterate from that right? – Eduardo Gonçalves Jan 07 '16 at 12:31
1

You should start from the definition of the result. How should you show the probabilities of belonging?

The result, IMHO, should be a set of categories and a table: rows for vertices and columns for categories, and in the cells there will be possibilities of belonging of that vertice to this category.

Your graph can set some probabilities of belonging only if you already have some start known probabilities. I.e, that table would be already partly filled.

While filling the table according to the start values and weights of edges we would surely come to the situation, when we are getting different probabilities in the cells, coming into it by different ways. One more point should be set: can we change the start values in the table or they are set hardly? The same question for the weights of the edges.

Now the task is partly defined, and the part is very, very small. You even don't know the number of categories!

After you set all these rules and numbers, all is quite trivial - use The Gauss Method of Lesser Squares. As for iterative way, be careful - you don't know beforehand if the solution is stable or if it exists. If not, the iteration won't converge, and the whole that piece of code you wrote for it is for nothing. And by Gauss method you are getting a set of linear equations, and the standard algorithms are written to solve it for all cases. And at the end you have not only the solution, but the possible mistake for every final value.

Gangnus
  • 24,044
  • 16
  • 90
  • 149
  • Thanks for the response. Still new at this, so let me ask you, what do you consider to be a stable solution from the linear equations, since I don't know the correct answer? – Eduardo Gonçalves Jan 08 '16 at 07:45