12

I am looking to split a string into ngrams of 3 characters - e.g HelloWorld would become "Hel", "ell", "llo", "loW" etc How would I achieve this using R?

In Python it would take a loop using the range function - e.g. [myString[i:] for i in range(3)]

Is there a neat way to loop through the letters of a string using stringr (or another suitable function/package) to tokenize the word into a vector?

e.g.

dfWords <- c("HelloWorld", "GoodbyeMoon", "HolaSun") %>% 
              data.frame()
names(dfWords)[1] = "Text"

I would like to generate a new column which would contain a vector of the tokenized Text variable (preferably using dplyr). This can then be split later into new columns.

Jaroslav Bezděk
  • 6,967
  • 6
  • 29
  • 46
Brisbane Pom
  • 521
  • 7
  • 18
  • 7
    There is the rarely used `substring` as opposed to `substr`, which loops over all inputs - `substring("HelloWorld",1:8,3:10)` - but this will only be suitable for a length 1 vector - as `substring(c("HelloWorld","ABC"),1:8,3:10)` doesn't work as expected. Will that be good enough? – thelatemail Aug 27 '19 at 02:24
  • @thelatemail looks good as an answer to me. – Ronak Shah Aug 27 '19 at 02:35

2 Answers2

13

For the others that are coming here, as I did, to really find the R function that would be an equivalent to range() function in Python, I have found the answer.

And it is seq() function. A few examples will be better than words but the usage is really the same as in Python:

> seq(from = 1, to = 5, by = 1)
[1] 1 2 3 4 5
> seq(from = 1, to = 6, by = 2)
[1] 1 3 5
> seq(5)
[1] 1 2 3 4 5
Jaroslav Bezděk
  • 6,967
  • 6
  • 29
  • 46
5

In base R you could do something like this

ss <- "HelloWorld"

len <- 3
lapply(seq_len(nchar(ss) - len + 1), function(x) substr(ss, x, x + len - 1))
#[[1]]
#[1] "Hel"
#
#[[2]]
#[1] "ell"
#
#[[3]]
#[1] "llo"
#
#[[4]]
#[1] "loW"
#
#[[5]]
#[1] "oWo"
#
#[[6]]
#[1] "Wor"
#
#[[7]]
#[1] "orl"
#
#[[8]]
#[1] "rld"

Explanation: The approach is a basic sliding window method to extract substrings from ss. The return object is a list.


Another (sliding window) alternative could be zoo::rollapply with strsplit

library(zoo)
len <- 3
rollapply(unlist(strsplit(ss, "")), len, paste, collapse = "")
[1] "Hel" "ell" "llo" "loW" "oWo" "Wor" "orl" "rld"

In response to your comment/edit, here's a tidyverse option

# Sample data
df <- data.frame(words = c("HelloWorld", "GoodbyeMoon", "HolaSun"))

library(tidyverse)
library(zoo)
df %>% mutate(lst = map(str_split(words, ""), function(x) rollapply(x, len, paste, collapse = "")))
#        words                                         lst
#1  HelloWorld      Hel, ell, llo, loW, oWo, Wor, orl, rld
#2 GoodbyeMoon Goo, ood, odb, dby, bye, yeM, eMo, Moo, oon
#3     HolaSun                     Hol, ola, laS, aSu, Sun
Maurits Evers
  • 49,617
  • 4
  • 47
  • 68