I want to force the input format when reading a file into Windows-1252 encoding together with Rcpp. I need this since I switch between Linux/Windows environments and while the files are consistently in 1252 encoding.
How do I adapt this to work:
String readFile(std::string path) {
std::ifstream t(path.c_str());
if (!t.good()){
std::string error_msg = "Failed to open file ";
error_msg += "'" + path + "'";
::Rf_error(error_msg.c_str());
}
const std::locale& locale = std::locale("sv_SE.1252");
t.imbue(locale);
std::stringstream ss;
ss << t.rdbuf();
return ss.str();
}
The above fails with:
Error in eval(expr, envir, enclos) :
locale::facet::_S_create_c_locale name not valid
I've also tried with "Swedish_Sweden.1252" that is the default for my system to no avail. I've tried #include <boost/locale.hpp>
but that seems to be unavailable in Rcpp (v 0.12.0)/BH boost (v. 1.58.0-1).
Update:
After digging a little deeper into this I'm not sure if the gcc (v. 4.6.3) in RTools (v. 3.3) is built with locale support, this SO question points to that possibility. If there is any argument except "" or "C" works with std::locale() it would be interesting to know, I've tried a few more alternatives but nothing seems to work.
Fallback solution
I'm not entirely satisfied but it seems that using the base::iconv()
fixes any issues with characters regardless of the original format, much thanks to the from="WINDOWS-1252"
argument forcing the chars to be interpreted in the correct form, i.e. if we want to stay in Rcpp we can simply do:
String readFile(std::string path) {
std::ifstream t(path.c_str());
if (!t.good()){
std::string error_msg = "Failed to open file ";
error_msg += "'" + path + "'";
::Rf_error(error_msg.c_str());
}
const std::locale& locale = std::locale("sv_SE.1252");
t.imbue(locale);
std::stringstream ss;
ss << t.rdbuf();
Rcpp::StringVector ret = ss.str();
Environment base("package:base");
Function iconv = base["iconv"];
ret = iconv(ret, Named("from","WINDOWS-1252"),Named("to","UTF8"));
return ret;
}
Note that it is preferrable to wrap the function in R rather than getting the function from C++ and then calling it from there, it is both less code and improves performance improvement by a factor of 2 (checked with microbenchmark):
readFileWrapper <- function(path){
ret <- readFile(path)
iconv(ret, from = "WINDOWS-1252", to = "UTF8")
}