0

I'm sure this is very simple, and I think it's a case of using separate and gather.

I have a single field in a dataframe, authorlist,an edited export of a pubmed search. It contains the authors of the publications. It can, obviously, contain either a single author or a collaboration of authors.

For example this is just a selection of the options available:

Author
Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P.

What I'd like to do is create a single list of ALL authors so that I'd have something like

Author
Drijgers RL
Verhey FR
Leentjens AF
Kahler S
Aalten P

How do I do that? I thought it would be something like

authSpread<-authorlist%>%separate(Author,sep =",",extra ="drop")

But it's not working. If I put into = "NA" I get just the first authors listed in a single column. What I'd like to do is replicate the text to columns function in excel, where you can specify the character to split at and the contents of the cell are cast/spread to new cells. And then regather them into one column. I don't know the maximum number of authors, and therefore don't know the number of columns to split by (or how to label them) programatically.

Edit: clarification I don't know if I want to make a long dataframe wide AND then gather - because I don't know how many fields would be generated. Is this a sensible thing? I would think I could write the output of the separate by "," to a list and then write the contents of that list as single data frame. Does that sound more efficient?

IRTFM
  • 258,963
  • 21
  • 364
  • 487
damo
  • 463
  • 4
  • 14

1 Answers1

2

You're looking for separate_rows.

Input:

df <- data.frame(authors = c("Drijgers RL, Verhey FR, Leentjens AF, Köhler S, Aalten P."))

                                                     authors
1 Drijgers RL, Verhey FR, Leentjens AF, Köhler S, Aalten P.

Function:

library(tidyverse)

df %>% separate_rows(authors, sep = ", ")

Output:

       authors
1  Drijgers RL
2    Verhey FR
3 Leentjens AF
4    Köhler S
5    Aalten P.

You can save them in a list like that:

authors_list <- df %>% separate_rows(authors, sep = ", ") %>% pull(authors)

Output:

[1] "Drijgers RL"  "Verhey FR"    "Leentjens AF" "Köhler S"    "Aalten P."   

If you have authors of multiple articles in your list and you want only unique occurences, just add unique() at the end:

authors_list <- df %>% separate_rows(authors, sep = ", ") %>% pull(authors) %>% unique()
arg0naut91
  • 14,574
  • 2
  • 17
  • 38
  • Thanks for the excellent solution. I can do this simply in excel: split out the first author, split out the last author. And then count the number of times an author appears as first or last author. How can I replicate this in R? I can count the total number of times an author has a publication from the separate rows answer above. (sorry for squeezing an extra question in) – damo Nov 15 '18 at 09:16
  • Final additional question (I hope). How would I split out the first and last authors to separate columns. That might be useful to know in the future. In this answer https://stackoverflow.com/questions/46717210/splitting-column-by-separator-from-right-to-left-in-r?rq=1 the number of columns is known. How do say "split this string at commas, and throw them into an unknown number of columns based on the number of names in the author list"? – damo Nov 15 '18 at 09:55
  • @damo, you're welcome, consider accepting the answer if it was helpful. I think it's best if you turn your second question into another question here on SO, as the content is different. – arg0naut91 Nov 15 '18 at 10:52
  • 1
    OK. Will do. Thanks again. – damo Nov 15 '18 at 11:09