I was working on a language parser and I wanted to count certain string elements (say "</i>"
) in a larger string. Since the string has been cleansed (str.trim
), it doesn't have any content after it. I was getting some weird behavior on strsplit
as it seems to behave differently if the separator sep
(called split
in RTM) is at the beginning or end of the string.
Below is an example:
str1 = "<i>hello friend</i>";
str2 = paste0(" ",str1);
str3 = paste0(str1, " ");
sep1="<i>";
sep2="</i>";
str = c(str1, str2, str3); n = length(str);
sep = c(sep1, sep2); ns = length(sep);
base = matrix("", nrow=n, ncol=ns);
rownames(base) = str; colnames(base) = sep;
for(i in 1:n)
{
for(j in 1:ns)
{
base[i, j] = paste0(base::strsplit(str[i], sep[j], fixed=TRUE)[[1]], collapse="|");
}
}
base;
stringi = matrix("", nrow=n, ncol=ns);
rownames(stringi) = str; colnames(stringi) = sep;
for(i in 1:n)
{
for(j in 1:ns)
{
stringi[i, j] = paste0(stringi::stri_split_fixed(str[i], sep[j])[[1]], collapse="|");
}
}
stringi;
stopifnot(identical(base,stringi));
The output for base:
> base;
<i> </i>
<i>hello friend</i> "|hello friend</i>" "<i>hello friend"
<i>hello friend</i> " |hello friend</i>" " <i>hello friend"
<i>hello friend</i> "|hello friend</i> " "<i>hello friend| "
The output for stringi:
> stringi;
<i> </i>
<i>hello friend</i> "|hello friend</i>" "<i>hello friend|"
<i>hello friend</i> " |hello friend</i>" " <i>hello friend|"
<i>hello friend</i> "|hello friend</i> " "<i>hello friend| "
The core difference is ROW=1, COL=2 ...
Question: What is E[strsplit]
?
Is base a FEATURE and stringi a BUG? Or vice versa?
Should not EOS (end of string) splits behave the same as BOS (beginning of string) splits?
> R.version
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
crt ucrt
system x86_64, mingw32
status
major 4
minor 2.1
year 2022
month 06
day 23
svn rev 82513
language R
version.string R version 4.2.1 (2022-06-23 ucrt)
nickname Funny-Looking Kid
and
> packageVersion("stringi")
[1] ‘1.7.8’
>