How do I interact with code that builds a vector of strings with std::vector<std::string>
and maintain UTF-8 encoding in the same way that Rcpp::String
does by default?
I have a std::vector<std::string>
of UTF-8 strings that I want to be able to return to R. Rcpp wrap()s
as expected, returning a character vector, but it appears to drop the UTF-8 encoding (on Windows). I'm assuming this is caused by R's underlying string behavior, but if an Rcpp::CharacterVector
is built using Rcpp::String
s, the behavior is correct.
Here's an example using a std::vector<std::string>
...
#include <Rcpp.h>
// [[Rcpp::export]]
std::vector<std::string> cpp_foo() {
std::string let1("ف");
std::string let2("خ");
std::vector<std::string> out;
out.push_back(let1);
out.push_back(let2);
return out;
}
Which mangles the strings.
cpp_foo()
# [1] "Ù\u0081" "Ø®"
Here's an example of the desired behavior, using Rcpp::String
...
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::CharacterVector rcpp_foo() {
Rcpp::String let1("ف");
Rcpp::String let2("خ");
Rcpp::CharacterVector out;
out.push_back(let1);
out.push_back(let2);
return out;
}
... which preserves the strings.
rcpp_foo()
# [1] "ف" "خ"
sessionInfo()
# R version 3.6.1 (2019-07-05)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 10 x64 (build 18362)
#
# Matrix products: default
#
# locale:
# [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
# [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
# [5] LC_TIME=English_United States.1252
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# loaded via a namespace (and not attached):
# [1] compiler_3.6.1 tools_3.6.1 Rcpp_1.0.2 packrat_0.5.0