Use unique pairs of column values to generate dyad identifiers in the dataframe

Question

I want to generate a set of dyad identifiers for a bilateral trade flow dataframe (that is coded in from, to, and amount traded format) such that I could use these identifiers for further statistical analysis.

My example data is provided at below, from which I have extracted and identified unique country dyads from the data that involve the US.

# load the example data
trade_flow <- readRDS(gzcon(url("https://www.dropbox.com/s/ep7xldoq9go4f0g/trade_flow.rds?dl=1")))
# extract country dyads
country_dyad <- trade_flow[, c("from", "to")]
# identify unique pairs
up <- country_dyad[!duplicated(t(apply(country_dyad, 1, sort))),]
# extract only unique pairs that involve the US
up <- up[(up$from == "USA") | (up$to == "USA"), ]

## how can I use the unique pair object (up) to generate dyad identifiers and include them as a new column in the trade_flow dataframe

The next step is match these unique dyad pairs from the original dataframe's (trade_flow) from and to columns and generate a list of unique dyad identifiers as a new column (say, dyad) to the df (trade_flow). It should look something like the format below in which each unique dyad is identified and coded as a unique numerical value. I will be grateful if someone could help me on this.

from    to  trade_flow  dyad
USA   ITA      5100       2
USA   UKG      4000       1
USA   GMY     17000       3
USA   ITA      4500       2
USA   JPN      2900       4
USA   UKG      6700       1
USA   ROK      7000       5
USA   UKG      2300       1
USA   SAF      1500       6
IND   USA      2400       7

G. Grothendieck · Accepted Answer · 2019-08-11T19:31:43.183

Assuming that flows are directioinal so that A/B and B/A are different flows, paste the from and to columns together and convert to factor. The internal codes that factor uses are 1, 2, ..., no_of_levels and to extract those use as.numeric.

transform(DF, dyad = as.numeric(factor(paste(from, to))))

giving:

   from  to trade_flow dyad
1   USA ITA       5100    3
2   USA UKG       4000    7
3   USA GMY      17000    2
4   USA ITA       4500    3
5   USA JPN       2900    4
6   USA UKG       6700    7
7   USA ROK       7000    5
8   USA UKG       2300    7
9   USA SAF       1500    6
10  IND USA       2400    1

Applying assignments made on subset to whole

If we want to perform this assignment only for a subset of rows of DF, for example head(DF), and then use those assignments for all of DF using NA for flows in DF that are not in DF0 then first perform the assignment of dyads as above (see first line below) and then remove the flow numbers from DF0 and extract its unique rows using unique. Finally merge that with the DF along the first two columns using all.x=TRUE so that unmatched rows in DF are not dropped.

DF0 <- transform(head(DF), dyad = as.numeric(factor(paste(from, to))))
merge(DF, unique(DF0[-3]), all.x = TRUE, by = 1:2)

giving:

   from  to trade_flow dyad
1   IND USA       2400   NA
2   USA GMY      17000    1
3   USA ITA       4500    2
4   USA ITA       5100    2
5   USA JPN       2900    3
6   USA ROK       7000   NA
7   USA SAF       1500   NA
8   USA UKG       4000    4
9   USA UKG       2300    4
10  USA UKG       6700    4

Note

Input in reproducible form:

Lines <- "from to trade_flow
USA   ITA      5100       
USA   UKG      4000       
USA   GMY     17000       
USA   ITA      4500       
USA   JPN      2900       
USA   UKG      6700       
USA   ROK      7000       
USA   UKG      2300       
USA   SAF      1500       
IND   USA      2400"
DF <- read.table(text = Lines, header = TRUE)

Hi, thanks for your reply, but if I want to use the subsetted country dyads (`up`) to generate dyad identifiers on the "full data" (`trade_flow`) while leaving the rest of the observations (rows) as NA (or simply set them as group 0), what part of the code should I modify? Thanks. — Chris T., Aug 11 '19 at 17:57
If what you want to do is perform the assignment on a subset of rows and then apply that assignment to all the rows then see the added section in the answer. — G. Grothendieck, Aug 11 '19 at 19:26
It seems that if I treat dyad `i`-`j` as undirected and use that to match the `from`, `to` columns in the dataframe and generate dyad identifier for all rows that involve `i`, `j` (including `j`-`i` link), it would leave those rows where `from` = `j` & `to` = `i` unmatched. R would thus return an error message shows "differing number of rows" between the original dataframe and the list of unique dyads. — Chris T., Aug 12 '19 at 06:46
I guess another way to address my question is: if we assume the `i`-`j` dyad as undirected, the recommended method would count both `i`-`j` and `j`-`i` as different dyads and assign them each with a unique numeric indicator. Is there any way to assign the same dyad indicator for both `i`-`j` and `j`-`i` pairs? — Chris T., Aug 12 '19 at 07:33
Using `paste(pmin(as.character(from), as.character(to)), pmax(as.character(from), as.character(to)))` would cause the dyads to be assigned in an undirected manner. — G. Grothendieck, Aug 13 '19 at 14:00

score 2 · Answer 2 · answered Aug 11 '19 at 16:51

Here is an option using base R

df1$dyad <- with(df1, as.integer(droplevels(interaction(from, to, 
        lex.order = TRUE))))
df1$dyad
#[1] 3 7 2 3 4 7 5 7 6 1

data

df1 <- structure(list(from = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 1L), .Label = c("IND", "USA"), class = "factor"), to = structure(c(2L, 
6L, 1L, 2L, 3L, 6L, 4L, 6L, 5L, 7L), .Label = c("GMY", "ITA", 
"JPN", "ROK", "SAF", "UKG", "USA"), class = "factor"), trade_flow = c(5100L, 
4000L, 17000L, 4500L, 2900L, 6700L, 7000L, 2300L, 1500L, 2400L
)), class = "data.frame", row.names = c(NA, -10L))

Use unique pairs of column values to generate dyad identifiers in the dataframe

2 Answers2

Applying assignments made on subset to whole

Note

data