2

Consider the folowing data.frame:

df <- structure(list(sufix = c("atizado", "atoria", "atório", "auta", 
                         "áutico", "ável"), min_stem_len = c(4, 5, 3, 5, 4, 2), replacement = c("", 
                                                                                                "", "", "", "", ""), exceptions = list(character(0), character(0), 
                                                                                                                                       character(0), character(0), character(0), c("afável", "razoável", 
                                                                                                                                                                                   "potável", "vulnerável"))), .Names = c("sufix", "min_stem_len", 
                                                                                                                                                                                                                          "replacement", "exceptions"), row.names = 21:26, class = c("tbl_df", 
                                                                                                                                                                                                                                                                                    "tbl", "data.frame"))

I have a list of strings in variable sufix of this data.frame. Now I have a word word <- "amável" and I want to get the sufix of this word with the same length as each word of the df$sufix.

I'm using the folowing code:

library(stringr)
word <- "amável"
str_sub(word, start = -stringr::str_length(df$sufix))

But this outputs this:

> str_sub(word, start = -stringr::str_length(df$sufix))
[1] "amável" "mável"  "mável"  "vel"    "mável"  "vel"   
> df$sufix
[1] "atizado" "atoria"  "atório"  "auta"    "áutico"  "ável"

I was expecting that the last element of the resulting vector to be "ável" since:

> str_length("ável")
[1] 4
> str_sub(word, start = -4)
[1] "ável"

Here a more simple reproducible example:

set.seed(100)
a <- sample(1:10, 10000, replace = T)
res <- rep("ábc", 10000) %>% str_sub(start = -a)
sum(ifelse(a > 3, 3, a) != str_length(res))
[1] 2504
Daniel Falbel
  • 1,721
  • 1
  • 21
  • 41

2 Answers2

1

If you notice, all the results are wrong (except by the first one).

They should have been

[1] "amável" "amável" "amável" "ável"   "amável" "ável" 

This could be solved easily by

library(stringi)
stri_sub(rep(word, 6), from = -stri_length(df$suffix))

I bet you could reuse your stringr code just the same.

### EDIT TO ADD ###

I now understand what you mean. Definitely there's a strange behavior realated, most likely, to the special character á. See the example below:

df <- data.frame(suffix = c("Lorem","ipsum","dolor","sit","amet","consectetur","adipiscing", "elit","Donec","arcu")) 
df$len <- stri_length(df$suffix)

Then look at the strange behavior in the 7th element of the result:

stri_sub("amavel", from = -df$len)
##  [1] "mavel"  "mavel"  "mavel"  "vel"    "avel"   "amavel" "amavel" "avel"  
##  [9] "mavel"  "avel" 

# Compared to
stri_sub("amável", from = -df$len)
##  [1] "mável"  "mável"  "mável"  "vel"    "ável"   "amável" "mável"  "ável"  
##  [9] "mável"  "ável"

Weird enough, the result is corrected in the last case if rep is used:

stri_sub(rep("amável", 10), from = -df$len)
## [1] "mável"  "mável"  "mável"  "vel"    "ável"   "amável" "amável" "ável"  
## [9] "mável"  "ável"

# note how the 7th element is now correct.

So even though it's a bit hacky, the solution provided above should work.

I tried looking at the code of stri_sub, where it refers to C_stri_sub, but that was a dead end for me. Perhaps somebody more knowledgeable of C and/or string manipulation can come and lend a hand?

### SECOND EDIT ###

It seems to me the problem is with the repetition of the string inside the call to stri_sub. Look at this alternative code to the one you put in your edit:

set.seed(100)
a <- sample(1:10, 10000, replace = TRUE)
res <- stri_sub(rep("ábc", 10000), from = -a)
sum(ifelse(a > 3, 3, a) != stri_length(res))
## [1] 0
PavoDive
  • 6,322
  • 2
  • 29
  • 55
  • I don't think this solve the problem. I'll add a more simple reproducible example. – Daniel Falbel Aug 29 '16 at 20:18
  • What's your expected output? My code produces the result I pasted on top of the answer, which is the one I would expect. – PavoDive Aug 29 '16 at 20:21
  • I know it solved. Look at the simple new example. Your solution does not work for bigger vectors. It makes no sense that your solution works too, since in documentation stringr says it recycles all arguments to the lenght of the longest one. – Daniel Falbel Aug 29 '16 at 20:25
1

This has been fixed in the development branch of stringi, see https://github.com/gagolews/stringi/issues/227 (as str_sub from stringr relies upon stri_sub in stringi). Once an update is available on CRAN, the correct behavior shall be replicable by anyone from the "general public", than is:

str_sub(word, start = -stringr::str_length(df$sufix))
## [1] "amável" "amável" "amável" "ável"   "amável" "ável"  
gagolews
  • 12,836
  • 2
  • 50
  • 75