3

I have read in a csv file in R with co-authorship data amongst other information. The Authors column of the file contains co-authorship information as follows:

Miyazaki T., Akisawa A., Saha B.B., El-Sharkawy I.I., Chakraborty A.
Saha B.B., Chakraborty A., Koyama S., Aristov Y.I.
Ali S.M., Chakraborty A.
...

I want to transform this information into an edge list with the following form:

Miyazaki T. Akisawa A.
Miyazaki T. Saha B.B.
Miyazaki T. El-Sharkawy I.I.
Miyazaki T. Chakraborty A.
Akisawa A.  Saha B.B.
Akisawa A. El-Sharkawy I.I.
Akisawa A.  Chakraborty A.
Saha B.B. El-Sharkawy I.I.
Saha B.B. Chakraborty A.
El-Sharkawy I.I. Chakraborty A.
Saha B.B. Chakraborty A.
Saha B.B. Koyama S.
....

Basically, the network is an undirected graph. Any help/starter code will be appreciated. Also, is there a way to maintain a count/frequency of collaboration (i.e. Saha has published with Chakraborty twice in the example)?

My code so far:

data <- read.csv(file="Citations.csv", header=TRUE)
split_authors <- strsplit(as.character(data$Authors), ',')
head(split_authors,5)

[[1]]
[1] "Miyazaki T."       " Akisawa A."       " Saha B.B."        " El-     Sharkawy I.I." " Chakraborty A."  

[[2]]
[1] "Saha B.B."       " Chakraborty A." " Koyama S."      " Aristov Y.I."  

[[3]]
[1] "Ali S.M."        " Chakraborty A."

[[4]]
[1] "Myat A."         " Thu K."         " Kim Y.-D."      " Chakraborty A." " Chun W.G."      " Ng K.C."       

[[5]]
[1] "Baran S.B."       " Kandadai S."     " Anutosh C."      " Khairul H."      " Ibrahim E.-S.I." " Shigeru K."
anandg112
  • 462
  • 2
  • 6
  • 14

1 Answers1

0

Given that your input data (dat in my example) has NA's for missing values for less than the maximum of authors per article, you can use the following R-code:

# data 
dat <- rbind(c("Miyazaki T.", "Akisawa A.", "Saha B.B.", "El-Sharkawy I.I.", "Chakraborty A."),
             c("Saha B.B.", "Chakraborty A.", "Koyama S.", "Aristov Y.I.", NA),
             c("Ali S.M.", "Chakraborty A.", NA, NA, NA))

# loop through all rows of dat (all papers, I presume)
transformed.dat <- lapply(1:nrow(dat), function(row.num) {

  row.el <- dat[row.num, ] # the row element that will be used in this loop

  # number of authors per paper
  n.authors <- length(row.el[!is.na(row.el)])

  # creates a matrix with all possible combinations (play around with n.authors, to see what it does)
  pairings <- combn(n.authors, 2)

 # loop through all pairs and return a vector with one row and two columns
  res <- apply(pairings, 2, function(vec) {
    return(t(row.el[vec]))
  })

  # create a data.frame with names aut1 and aut2
  res <- data.frame(aut1 = res[1, ],
                    aut2 = res[2, ])

  return(res)
})

# use data.table's rbindlist to bind the list of combinations together
final.dat <- data.table::rbindlist(transformed.dat)

final.dat
#         aut1             aut2
# 1:      Miyazaki T.       Akisawa A.
# 2:      Miyazaki T.        Saha B.B.
# 3:      Miyazaki T. El-Sharkawy I.I.
# 4:      Miyazaki T.   Chakraborty A.
# 5:       Akisawa A.        Saha B.B.
# 6:       Akisawa A. El-Sharkawy I.I.
# 7:       Akisawa A.   Chakraborty A.
# 8:        Saha B.B. El-Sharkawy I.I.
# 9:        Saha B.B.   Chakraborty A.
# 10: El-Sharkawy I.I.   Chakraborty A.
# 11:        Saha B.B.   Chakraborty A.
# 12:        Saha B.B.        Koyama S.
# 13:        Saha B.B.     Aristov Y.I.
# 14:   Chakraborty A.        Koyama S.
# 15:   Chakraborty A.     Aristov Y.I.
# 16:        Koyama S.     Aristov Y.I.
# 17:         Ali S.M.   Chakraborty A.

Does that satisfy your question? The key is the combn-function that creates the possible combinations

David
  • 9,216
  • 4
  • 45
  • 78
  • Hi David, thanks for your kind help. I have 8474 observations of co-authorship so its a big column and I don't have NA values. Should I extract the Authors column first, separate the author field by a strsplit and add the NA values? Is there a function to add the NA values? – anandg112 Nov 05 '15 at 14:16
  • What exactly do you mean by observations; do you mean paper (you would three observations in the example above), or do you mean authors? Can you post something like `head(data)`, where `data` is the loaded `csv` file in `R`?! – David Nov 05 '15 at 15:06
  • Hi David, I added the code that I have written so far above. Please advise. – anandg112 Nov 05 '15 at 15:22
  • Each row has author names who have collaborated on a paper, so each row is unique. – anandg112 Nov 05 '15 at 15:28
  • I would first try to find the max number of authors `max.authors <- max(sapply(split.authors, length))`, then create the data.frame `dat`, and fill up `NA`s upto `max.authors`. – David Nov 05 '15 at 16:02
  • Or even better, as your initial data structure is a list, use the `lapply`-function to loop through the row elements. As in `lapply(split_authors, function(row.el) { ...` – David Nov 05 '15 at 16:04