Separate value in field by character, create multiple columns to the right based on the number of splits possible

Question

I've asked a series of questions.

Separate variable in field by character.

Which I think contained multiple questions around the same topic.

I've had excellent answers on how to use separate_rows and then a great answer on how to separate the first and last authors from a character vector.

What I'd like to know now is the final bit:

In this answer Splitting column by separator from right to left in R

the number of columns is known. How do I say "split this string at commas, and throw them into an unknown number of columns based on the number of names in the author list to the right of the original field"?

So that each author becomes a value in a separate field. Initially, I thought it would be cast/spread. BUT!

While this is the example I've used: Author

Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P.

in many cases the number of authors on the paper (not including et. al) will be >1 and could be somewhere around 30 at the most.

So. Final question on this three part saga... How do I separate out all authors to a new field and if I could title the new fields something like First author, second author, etc and so on up to Last Author.

Is that sensible/clear?

appreciate there's two or three people who are helping very quickly.

That's helpful and I appreciate feedback. I'm trying to work out how I could have done this better. Initially, I asked a question, which had a very fast response, and then a suggestion I ask the second part of my question as a new question. I guess because the solution was very simple. Then this next question was split again into a third question. Which has given this answer. What should I have done here? — damo, Nov 15 '18 at 14:14
you could have provided a [mcve] of at least what you've tried already. And maybe not writing your question like a comic book ;) sticking more to facts instead of writinc cadences of prose helps us finding the relevant info faster. — LuckyLikey, Nov 19 '18 at 05:57
Ok. Thanks. I don't want to argue, but when the initial problem gets split out into multiple questions - what then? I'll take the point about writing style on board. — damo, Nov 19 '18 at 06:31
alaway intended to help you lad. to answer this: each question should fit the Q&A format. That a problem can split into multiples smaller problems to solve is usual in the development process. But multiple small problems compose the big problem. As you were asking your small questions, you should rely on simplifying them to a [mcve] that is not linked to anything but the things that you read and tried about that specific problem. Anyway you'll learn these things once you try to answer questions yourself. Have a nice day :) — LuckyLikey, Nov 19 '18 at 06:56
Ah OK. So rather than "here's the links, it should have been "here's what's been done so far". Would it make sense/be helpful to put it all back together in one post and answer? (Again, thanks for taking the time) — damo, Nov 19 '18 at 09:13
everyone is allowed to put multiple questions on SO, but IMHO they should not depend on each other and instead be standalone and follow these guidelines [ask] [mcve] — LuckyLikey, Nov 19 '18 at 09:42
I agree, frustratingly, the answer to the first question (with additional requests) is given in the answer to this question.... — damo, Nov 19 '18 at 09:55

score 1 · Accepted Answer · answered Nov 15 '18 at 13:28

You can split your author column into a list with str_split and then use unnest to get long format dataframe with a new author on each line. Then you use spread and an ID column to get the data into wide format.

library(dplyr)
library(tidyr)
df <- data.frame(publication = c("pub1","pub2","pub3"),author = c("Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P","test author","test arthur, another author"))
df
#  publication                                                   author
#1        pub1 Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P
#2        pub2                                              test author
#3        pub3                              test arthur, another author


df %>% group_by(publication) %>% mutate(author = str_split(author,", ")) %>% unnest %>% mutate(ID = paste0("author_",row_number())) %>% spread(ID,author)
# A tibble: 3 x 6
# Groups:   publication [3]
#  publication author_1    author_2       author_3     author_4 author_5
#  <fct>       <chr>       <chr>          <chr>        <chr>    <chr>   
#1 pub1        Drijgers RL Verhey FR      Leentjens AF Kahler S Aalten P
#2 pub2        test author NA             NA           NA       NA      
#3 pub3        test arthur another author NA           NA       NA

So it's mutate with paste Author+row number that does the "split by however many authors in the author list". Nice. Thank you! — damo, Nov 15 '18 at 13:39
The mutate simply adds an ID for each row in the group. A consistent labeling of the IDs for each group allows the spread function to turn them into columns. — jasbner, Nov 15 '18 at 13:42
Yes. Sorry, I wasn't sure what to say there. I could see it was generating "the key" to allow the spread and label. Thanks again! — damo, Nov 15 '18 at 13:44

Separate value in field by character, create multiple columns to the right based on the number of splits possible

1 Answers1