Separate variable in field by character

Question

I recently asked this question Separate contents of field And got a very quick and very simple answer.

Something I can do simply in Excel is look in a cell, find the first instance of a character and then return all the characters to the left of that.

For example

Author

Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P.

I can extract Drijgers RL and Aalten P into separate columns in excel. This lets me count the number of times someone is a first author and also the last author.

How can I replicate this in R? I can count the total number of times an author has a publication from the separate rows answer above.

How would I split out the first and last authors to separate columns. That might be useful to know. In this answer Splitting column by separator from right to left in R

the number of columns is known. How do say "split this string at commas, and throw them into an unknown number of columns based on the number of names in the author list to the right of the original field"?

arg0naut91 · Answer 1 · 2018-11-15T12:04:02.887

Try this function:

extract_authors <- function(df, authors) {

  df[["FirstAuthor"]] <- ifelse(
    grepl(",", df[[authors]]), trimws(gsub(",.*", "", df[[authors]])), df[[authors]]
  )


  df[["LastAuthor"]] <- ifelse(
    grepl(",", df[[authors]]), trimws(gsub(".*,", "", df[[authors]])), "No last author"
  )

  return(df)

}

Works with the other example from this topic:

data.frame(
  authors = c(
    "Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P.",
    "Drijgers RL, Verhey FR, Leentjens AF, Kahler S",
    "Drijgers RL, Verhey FR, Leentjens AF",
    "Drijgers RL, Verhey FR",
    "Drijgers RL"
  ),
  stringsAsFactors = FALSE
) -> sample_df

You can call it like:

extract_authors(df, "authors")

In the output, you get 2 new columns, FirstAuthor and LastAuthor:

                                                    authors FirstAuthor     LastAuthor
1 Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P. Drijgers RL      Aalten P.
2            Drijgers RL, Verhey FR, Leentjens AF, Kahler S Drijgers RL       Kahler S
3                      Drijgers RL, Verhey FR, Leentjens AF Drijgers RL   Leentjens AF
4                                    Drijgers RL, Verhey FR Drijgers RL      Verhey FR
5                                               Drijgers RL Drijgers RL No last author

Cool. I hoped you would augment your solution since it's ~6x faster than the one I showed :-) — hrbrmstr, Nov 15 '18 at 12:48
take a look at the microbenchmark in my answer now. even with a new `stringi` solution yours is still blazingly faster. Serious props! — hrbrmstr, Nov 15 '18 at 13:45
Thanks @hrbrmstr! I'm surprised to see that as mine was supposed to be a pure convenience function, and its robustness would be questionable with further requirements, e.g. if OP wants to start extracting second, third,.. etc. authors. But good old `ifelse` and `grepl` may not be such performance bottlenecks after all, depending on the use case of course. — arg0naut91, Nov 15 '18 at 13:51

hrbrmstr · Accepted Answer · 2018-11-15T13:45:11.103

data.frame(
  authors = c(
    "Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P.",
    "Drijgers RL, Verhey FR, Leentjens AF, Kahler S",
    "Drijgers RL, Verhey FR, Leentjens AF",
    "Drijgers RL, Verhey FR",
    "Drijgers RL"
  ),
  stringsAsFactors = FALSE
) -> sample_df

cbind.data.frame( # add the columns to the original data frame after the do.cal() completes
  sample_df,
  do.call( # turn the list created with lapply below into a data frame
    rbind.data.frame, 
    lapply(
      strsplit(sample_df$authors, ", "), # split at comma+space
      function(x) {
        data.frame( # pull first/last into a data frame
          first = x[1],
          last = if (length(x) < 2) NA_character_ else x[length(x)], # NA last if only one author
          stringsAsFactors = FALSE
        )
      }
    )
  )
)
##                                                     authors       first         last
## 1 Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P. Drijgers RL    Aalten P.
## 2            Drijgers RL, Verhey FR, Leentjens AF, Kahler S Drijgers RL     Kahler S
## 3                      Drijgers RL, Verhey FR, Leentjens AF Drijgers RL Leentjens AF
## 4                                    Drijgers RL, Verhey FR Drijgers RL    Verhey FR
## 5                                               Drijgers RL Drijgers RL         <NA>

The above is terrible performance-wise. I made a stringi match group extraction version but arg0naut's is still faster and I also optimized arg0naut's a bit since the whitespace stripping will only be needed on the left:

library(stringi)

data.frame(
  authors = c(
    "Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P.",
    "Drijgers RL, Verhey FR, Leentjens AF, Kahler S",
    "Drijgers RL, Verhey FR, Leentjens AF",
    "Drijgers RL, Verhey FR",
    "Drijgers RL"
  ),
  stringsAsFactors = FALSE
) -> sample_df

# make some copies since we're modifying in-place now
s1 <- s2 <- sample_df

microbenchmark::microbenchmark(

  stri_regex = {
    s1$first <-  stri_match_first_regex(s1$authors, "^([^,]+)")[,2]
    s1$last <- stri_trim_left(stri_match_last_regex(s1$authors, "([^,]+)$")[,2])
    s1$last <- ifelse(s1$last == s1$first, NA_character_, s1$last)
  },

  extract_authors = {
    s2[["first"]] <- ifelse(
      grepl(",", s2[["authors"]]), gsub(",.*", "", s2[["authors"]]), s2[["authors"]]
    )
    s2[["last"]] <- ifelse(
      grepl(",", s2[["authors"]]), trimws(gsub(".*,", "", s2[["authors"]]), "left"), NA_character_
    )

  }

)

Results:

## Unit: microseconds
##             expr     min       lq     mean   median       uq      max neval
##       stri_regex 236.948 265.8055 331.5695 291.6610 334.1685 1002.921   100
##  extract_authors 127.584 150.8490 217.1192 162.4625 227.9995 1130.913   100

identical(s1, s2)
## [1] TRUE

s1
##                                                     authors       first         last
## 1 Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P. Drijgers RL    Aalten P.
## 2            Drijgers RL, Verhey FR, Leentjens AF, Kahler S Drijgers RL     Kahler S
## 3                      Drijgers RL, Verhey FR, Leentjens AF Drijgers RL Leentjens AF
## 4                                    Drijgers RL, Verhey FR Drijgers RL    Verhey FR
## 5                                               Drijgers RL Drijgers RL         <NA>

Separate variable in field by character

2 Answers2

Linked