4

Suppose I have a matrix of distance cost, in which the cost of destiny and the cost of origin both need to be below a certain threshold amount - say, US 100 -- to share a link. My difficulty lies in achieving a common set after classifying these localities: A1 links (cost of destiny and origin below threshold) with A2 and (same thing) A3 and A4; A2 links with A1 and A4; A4 links with A1 and A2. So A1, A2 and A4 would be classified in the same group, as being the group with highest frequency of links between themselves. Below I set a matrix as an example:

    A1  A2  A3  A4  A5  A6  A7
A1  0   90  90  90  100 100 100
A2  80  0   90  90  90  110 100
A3  80  110 0   90  120 110 90
A4  90  90  110 0   90  100 90
A5  110 110 110 110 0   90  80
A6  120 130 135 100 90  0   90
A7  105 110 120 90  90  90  0

I am programming this with Stata and I haven't placed the matrix above in matrix form, as in mata. The column listing the letters A plus the number is a variable with the rownames of the matrix and the rest of the columns are named with each locality name (e.g. A1 and so on).

I have returned the list of links between each locality with the following code, which maybe I did it very "bruteforcelly" since I was in a hurry:

    clear all

    set more off

    //inputting matrix

    input A1 A2 A3 A4 A5 A6 A7
    0 90 90 90 100 100 100
    80 0 90 90 90 100 100
    80 110 0 90 120 110 90
    90 90 110 0 90 100 90
    110 110 110 110 0 90 90
    120 130 135 100 90 0 90
    105 110 120 90 90 90 0

    end

    //generate row variable

    gen locality=""

    forv i=1/7{

        replace locality="A`i'" in `i'

    }
    *

    order locality, first


    //generating who gets below the threshold of 100

    forv i=1/7{

        gen r_`i'=0

        replace r_`i'=1 if A`i'<100 & A`i'!=0

    }
    *

    //checking if both ways (origin and destiny below threshold)

    forv i=1/7{

        gen check_`i'=.

    forv j=1/7{

            local v=r_`i'[`j']

            local vv=r_`j'[`i']

            replace check_`i'=`v'+`vv' in `j'

                }

    *
        }
    *

    //creating list of links

    gen locality_x=""

    forv i=1/7{

        preserve

        local name = locality[`i']

        keep if check_`i'==2

        replace locality_x="`name'"

        keep locality locality_x

        save "C:\Users\user\Desktop\temp_`i'", replace

        restore

    }
    *

    use "C:\Users\user\Desktop\temp_1", clear

    forv i=2/7{

        append using "C:\Users\user\Desktop\temp_`i'"
    }
    *

    //now locality_x lists if A.1 has links with A.2, A.3 etc. and so on.
    //the dificulty lies in finding a common intersection between the groups.

Which returns the following listing:

locality_x  locality
A1  A2
A1  A3
A1  A4
A2  A1
A2  A4
A3  A1
A4  A1
A4  A2
A4  A7
A5  A6
A5  A7
A6  A5
A6  A7
A7  A4
A7  A5
A7  A6

I am trying to get familiar with set-intersection, but I haven't a clue of how to do this in Stata. I want to do something in which I could reprogram the threshold and find the common set. I would be thankful if you could produce a solution in R, given that I can program a bit in it.


A similar way of obtaining the list in R (as @user2957945 put in his answer below):

structure(c(0L, 80L, 80L, 90L, 110L, 120L, 105L, 90L, 0L, 110L, 
90L, 110L, 130L, 110L, 90L, 90L, 0L, 110L, 110L, 135L, 120L, 
90L, 90L, 90L, 0L, 110L, 100L, 90L, 100L, 90L, 120L, 90L, 0L, 
90L, 90L, 100L, 110L, 110L, 100L, 90L, 0L, 90L, 100L, 100L, 90L, 
90L, 80L, 90L, 0L), .Dim = c(7L, 7L), .Dimnames = list(c("A1", 
"A2", "A3", "A4", "A5", "A6", "A7"), c("A1", "A2", "A3", "A4", 
"A5", "A6", "A7")))

# get values less than threshold
id = m < 100 
# make sure both values are less than threshold, and dont include diagonal
m_new = (id + t(id) == 2) & m !=0 
# melt data and subset to keep TRUE values (TRUE if both less than threshold and not on diagonal)
result  = subset(reshape2::melt(m_new), value)
# reorder to match question results , if needed 
result[order(result[[1]], result[[2]]), 1:2] 

   Var1 Var2
8    A1   A2
15   A1   A3
22   A1   A4
2    A2   A1
23   A2   A4
3    A3   A1
4    A4   A1
11   A4   A2
46   A4   A7
40   A5   A6
47   A5   A7
34   A6   A5
48   A6   A7
28   A7   A4
35   A7   A5
42   A7   A6     

I'm also adding the "graph theory" tag since I believe this is not exactly a intersection problem, in which I could transform the list in vectors and use the intersect function in R. The code needs to produce a new id in which some localities must be in the same new id (group). As in the example above, if the set of A.1 has A.2 and A.4, A.2 has A.1 and A.4 and A.4 has A.1 and A.2, these three localities must be in the same id (group). In other words, I need the biggest intersection grouping of each locality. I understand that there might problems with a different matrix, such as A.1 has A.2 and A.6, A.2 has A.1 and A.6 and A.6 has A.1 and A.2 (but A.6 does not have A.4, considering the first example above still). In that situation, I welcome a solution of adding A.6 to the grouping or some other arbitrary one, in which the code just groups the first set together, removes A.1, A.2 and A.4 from the listing, and leaves A.6 with no new grouping.

John Doe
  • 212
  • 1
  • 9
  • 28
  • In your example A4 links with A7, and A7 links with A5 and A6. So, the whole graph is connected. Do you then want all A in the same group? – Andrei May 15 '17 at 10:41
  • The whole graph is connected -- yes --, but what I want to find is the highest intersection set between the localities, as in the example I discussed. – John Doe May 15 '17 at 17:25
  • Why not A5, A6, and A7? They also have 3 links between themselves. Are you looking for all complete subgraphs of your graph? Or do I still miss something? – Andrei May 15 '17 at 18:34
  • A5, A6 and A7 would be grouped together too. Fiding a complete subgraph could be part of the solution, but I want the biggest grouping possible, given the links. – John Doe May 15 '17 at 19:19
  • I am still trying to get my head around what you mean exactly. Do you want a subgraph where the ratio of links to nodes is the largest? Do you want a connected subgraph where the number of nodes is the largest? When you say "grouping," what do you mean by it? – Andrei May 15 '17 at 19:41
  • I'm not really familiar with graph theory, so bear with me. I will try to put it like this: for each locality (node), I believe I want to find its biggest complete subgraph. The set of A1, A2 and A4 is the biggest subgraph for them all -- please correct me if I made any mistake. As such, they're classified together. – John Doe May 16 '17 at 11:57

2 Answers2

2

In R you can do

# get values less then threshold
id = m < 100 
# make sure both values are less then threshold, and dont include diagonal
m_new = (id + t(id) == 2) & m !=0 
# melt data and subset to keep TRUE values (TRUE if both less than threshold and not on diagonal)
result  = subset(reshape2::melt(m_new), value)
# reorder to match question results , if needed 
result[order(result[[1]], result[[2]]), 1:2] 

   Var1 Var2
8    A1   A2
15   A1   A3
22   A1   A4
2    A2   A1
23   A2   A4
3    A3   A1
4    A4   A1
11   A4   A2
46   A4   A7
40   A5   A6
47   A5   A7
34   A6   A5
48   A6   A7
28   A7   A4
35   A7   A5
42   A7   A6

.

structure(c(0L, 80L, 80L, 90L, 110L, 120L, 105L, 90L, 0L, 110L, 
90L, 110L, 130L, 110L, 90L, 90L, 0L, 110L, 110L, 135L, 120L, 
90L, 90L, 90L, 0L, 110L, 100L, 90L, 100L, 90L, 120L, 90L, 0L, 
90L, 90L, 100L, 110L, 110L, 100L, 90L, 0L, 90L, 100L, 100L, 90L, 
90L, 80L, 90L, 0L), .Dim = c(7L, 7L), .Dimnames = list(c("A1", 
"A2", "A3", "A4", "A5", "A6", "A7"), c("A1", "A2", "A3", "A4", 
"A5", "A6", "A7")))
user2957945
  • 2,353
  • 2
  • 21
  • 40
  • i'm not familiar with the `structure` function. Can you describe what it does? – John Doe May 13 '17 at 21:04
  • all r objects have a structure. But here it is used as a format for the data so that can be easily be copied and pasted from here in to others R session, without having to mess about with getting the data from your question in to a format that is easy to use. It doesnt relate to the actual answer. You can get this by using `dput` eg `dput(iris[1:3, 1:3])` – user2957945 May 13 '17 at 21:10
  • I will incorporate your answer hoping someone can answer about finding a common intersection between the listing. – John Doe May 13 '17 at 21:16
1

Assuming what you want are the largest complete subgraphs, you can use the igraph package:

# Load necessary libraries
library(igraph)

# Define global parameters
threshold <- 100

# Compute the adjacency matrix
# (distances in both directions need to be smaller than the threshold)
am <- m < threshold & t(m) < threshold

# Make an undirected graph given the adjacency matrix
# (we set diag to FALSE so as not to draw links from a vertex to itself)
gr <- graph_from_adjacency_matrix(am, mode = "undirected", diag = FALSE)

# Find all the largest complete subgraphs
lc <- largest_cliques(gr)

# Output the list of complete subgraphs as a list of vertex names
lapply(lc, (function (e) e$name))

As far as I know, there is no similar functionality in Stata. However, if you were looking for the largest connected subgraph (which is the whole graph in your case), then you could have used clustering commands in Stata (i.e. clustermat).

Andrei
  • 2,585
  • 1
  • 14
  • 17