1

I was hoping for some help with extracting the last N words from a column in a data.table.. and then assigning it to a new column.

 test <- data.table(original = c('the green shirt totally brings out your eyes'
                               , 'ford focus hatchback'))

The original data.table looks like this:

original
1: the green shirt totally brings out your eyes
2: ford focus hatchback

I want to subset out (up to) the last 5 words into a new column, so the output looks like:

original                                        extracted
1: the green shirt totally brings out your eyes totally brings out your eyes
2: ford focus hatchback                         ford focus hatchback

I tried:

  test <- test[, extracted := paste0(tail(strsplit(original, ' ')[[1]], 5)
                                   , collapse = ' ')]

and it almost works, except that the 1st value in the 'extracted' column is repeated throughout the new column:

original                                        extracted
1: the green shirt totally brings out your eyes totally brings out your eyes
2: ford focus hatchback                         totally brings out your eyes

For the life of me I can't figure this out. I tried the 'word' function from 'stringr' which gives me the last word, but I can't seem to count backwards.

Any help would be greatly appreciated!

AlexP
  • 577
  • 1
  • 4
  • 15

2 Answers2

4

I would probably use

n = 5
patt = sprintf("\\w+( \\w+){0,%d}$", n-1)

library(stringi)
test[, ext := stri_extract(original, regex = patt)]

                                       original                          ext
1: the green shirt totally brings out your eyes totally brings out your eyes
2:                         ford focus hatchback         ford focus hatchback

Comments:

  • This breaks if you set n=0, but there's probably no good reason to do that.
  • This is vectorized, in case you have n differing across rows (e.g., n=3:4).
  • @eddi provided a base analogue (for fixed n):

    test[, ext := sub('.*?(\\w+( \\w+){4})$', '\\1', original)]
    
Frank
  • 66,179
  • 8
  • 96
  • 180
3

Base R solution:

test[,extracted:=sapply(strsplit(original,'\\s+'),function(v) paste(collapse=' ',tail(v,5L)))];
##                                        original                    extracted
## 1: the green shirt totally brings out your eyes totally brings out your eyes
## 2:                         ford focus hatchback         ford focus hatchback
bgoldst
  • 34,190
  • 6
  • 38
  • 64