R: Extract last N words from character column in data.table

Question

I was hoping for some help with extracting the last N words from a column in a data.table.. and then assigning it to a new column.

 test <- data.table(original = c('the green shirt totally brings out your eyes'
                               , 'ford focus hatchback'))

The original data.table looks like this:

original
1: the green shirt totally brings out your eyes
2: ford focus hatchback

I want to subset out (up to) the last 5 words into a new column, so the output looks like:

original                                        extracted
1: the green shirt totally brings out your eyes totally brings out your eyes
2: ford focus hatchback                         ford focus hatchback

I tried:

  test <- test[, extracted := paste0(tail(strsplit(original, ' ')[[1]], 5)
                                   , collapse = ' ')]

and it almost works, except that the 1st value in the 'extracted' column is repeated throughout the new column:

original                                        extracted
1: the green shirt totally brings out your eyes totally brings out your eyes
2: ford focus hatchback                         totally brings out your eyes

For the life of me I can't figure this out. I tried the 'word' function from 'stringr' which gives me the last word, but I can't seem to count backwards.

Any help would be greatly appreciated!

Frank · Accepted Answer · 2016-04-20T20:10:47.567

I would probably use

n = 5
patt = sprintf("\\w+( \\w+){0,%d}$", n-1)

library(stringi)
test[, ext := stri_extract(original, regex = patt)]

                                       original                          ext
1: the green shirt totally brings out your eyes totally brings out your eyes
2:                         ford focus hatchback         ford focus hatchback

Comments:

This breaks if you set n=0, but there's probably no good reason to do that.
This is vectorized, in case you have n differing across rows (e.g., n=3:4).

@eddi provided a base analogue (for fixed n):

test[, ext := sub('.*?(\\w+( \\w+){4})$', '\\1', original)]

bgoldst · Answer 2 · 2016-04-20T19:22:06.890

3

Base R solution:

test[,extracted:=sapply(strsplit(original,'\\s+'),function(v) paste(collapse=' ',tail(v,5L)))];
##                                        original                    extracted
## 1: the green shirt totally brings out your eyes totally brings out your eyes
## 2:                         ford focus hatchback         ford focus hatchback

edited Apr 20 '16 at 19:22

answered Apr 20 '16 at 19:07

bgoldst

34,190
6
38
64

R: Extract last N words from character column in data.table

2 Answers2