31

I have a dataframe below and want to extract the first word and insert it into a new column

Dataframe1:

COL1
Nick K Jones
Dave G Barros
Matt H Smith

Convert it to this:

Dataframe2:
COL1              COL2
Nick K Jones      Nick
Dave G Barros     Dave
Matt H Smith      Matt
zx8754
  • 52,746
  • 12
  • 114
  • 209
Nick
  • 833
  • 2
  • 8
  • 11

3 Answers3

49

We can use function stringr::word:

library(stringr)

Dataframe1$COL2 <- word(Dataframe2$COL1, 1)
zx8754
  • 52,746
  • 12
  • 114
  • 209
Colibri
  • 682
  • 6
  • 8
  • 2
    This works well but is very slow for larger data. I'm working with half-a-milion rows and `str_extract(Dataframe2$COL1, '[A-Za-z]+')` (also from the `stringr` package) is at least ten times faster. – nJGL Nov 14 '19 at 14:19
  • clearly the best answer for a problem that is meant to be simple – Garini Feb 23 '21 at 22:47
32

You can use a regex ("([A-Za-z]+)" or "([[:alpha:]]+)"or "(\\w+)") to grab the first word

Dataframe1$COL2 <- gsub("([A-Za-z]+).*", "\\1", Dataframe1$COL1)
Rorschach
  • 31,301
  • 5
  • 78
  • 129
  • 1
    why use `gsub` when you need to replace just first occurrence. use `sub` – Saksham Aug 11 '15 at 18:12
  • 1
    @Saksham you're right `sub` would be better here, thanks – Rorschach Aug 12 '15 at 05:53
  • What if the first word is a number: 495 or Q1? When I try this formula it just keeps "Q" and not Q1, and for 495, it takes all the numbers after it: "495 3Be" @nongkrong – Nick Aug 12 '15 at 21:52
  • 2
    @Nick try the option `"(\\w+)"`, or you can add into the brackets the options for matching numbers, ie. `[0-9A-Za-z]+` and `[[:digit:]]` – Rorschach Aug 12 '15 at 21:55
  • That didn't work unfortunately. I basically just want to grab the first word (whether that be characters or numbers before a space). So if I have P1 Media, in the past it would print out P. For 495 54, it would print out everything instead of just 495. @nongkrong – Nick Aug 12 '15 at 22:10
  • @Nick Specifically, did you try `sub("(\\w+).*", "\\1", Dataframe1$COL1)` – Rorschach Aug 12 '15 at 22:14
  • That worked thanks! and last question: is there a way to make all of them lowercase? @nongkrong – Nick Aug 13 '15 at 12:22
  • @Nick: you can set the argument `ignore.case = TRUE` to not worry about case sensitivity anymore. Or use `tolower()` – andschar Apr 11 '17 at 11:17
  • The above worked. I'm looking for a solution that, in addition to the one above, creates a new column with the remaining variables after the split. e.g. in the above example, we have a new column, COL3 that has values as `K Jones`, `G Barros` – andy Aug 16 '22 at 11:25
14

The function strsplit can be useful

Dataframe1$COL2 <- strsplit(Dataframe1$COL1, " ")[[1]][1]

Then you can change the last bracketed number to select other parts from the string too.

mattbawn
  • 1,358
  • 2
  • 13
  • 33