3

I want to force the input format when reading a file into Windows-1252 encoding together with Rcpp. I need this since I switch between Linux/Windows environments and while the files are consistently in 1252 encoding.

How do I adapt this to work:

String readFile(std::string path) {
  std::ifstream t(path.c_str());
  if (!t.good()){
    std::string error_msg = "Failed to open file ";
    error_msg += "'" + path + "'";
    ::Rf_error(error_msg.c_str());
  }

  const std::locale& locale = std::locale("sv_SE.1252");
  t.imbue(locale); 
  std::stringstream ss;
  ss << t.rdbuf();
  return ss.str();
}

The above fails with:

Error in eval(expr, envir, enclos) : 
  locale::facet::_S_create_c_locale name not valid

I've also tried with "Swedish_Sweden.1252" that is the default for my system to no avail. I've tried #include <boost/locale.hpp> but that seems to be unavailable in Rcpp (v 0.12.0)/BH boost (v. 1.58.0-1).

Update:

After digging a little deeper into this I'm not sure if the gcc (v. 4.6.3) in RTools (v. 3.3) is built with locale support, this SO question points to that possibility. If there is any argument except "" or "C" works with std::locale() it would be interesting to know, I've tried a few more alternatives but nothing seems to work.

Fallback solution

I'm not entirely satisfied but it seems that using the base::iconv() fixes any issues with characters regardless of the original format, much thanks to the from="WINDOWS-1252"argument forcing the chars to be interpreted in the correct form, i.e. if we want to stay in Rcpp we can simply do:

String readFile(std::string path) {
  std::ifstream t(path.c_str());
  if (!t.good()){
    std::string error_msg = "Failed to open file ";
    error_msg += "'" + path + "'";
    ::Rf_error(error_msg.c_str());
  }

  const std::locale& locale = std::locale("sv_SE.1252");
  t.imbue(locale); 
  std::stringstream ss;
  ss << t.rdbuf();
  Rcpp::StringVector ret = ss.str();

  Environment base("package:base");
  Function iconv = base["iconv"];

  ret = iconv(ret, Named("from","WINDOWS-1252"),Named("to","UTF8"));

  return ret;
}

Note that it is preferrable to wrap the function in R rather than getting the function from C++ and then calling it from there, it is both less code and improves performance improvement by a factor of 2 (checked with microbenchmark):

readFileWrapper <- function(path){
   ret <- readFile(path)
   iconv(ret, from = "WINDOWS-1252", to = "UTF8")
}
Community
  • 1
  • 1
Max Gordon
  • 5,367
  • 2
  • 44
  • 70
  • 1
    Can you try Rcpp 0.12.0 which was just released? It added some encoding support. – Dirk Eddelbuettel Jul 28 '15 at 11:46
  • @DirkEddelbuettel: Thanks, but unfortunately I get the same error. I'm not sure if the syntax is at all correct, i.e. is it any of the above suggested alternatives? As I wrote, there is a [boost::locale](http://www.boost.org/doc/libs/1_51_0/libs/locale/doc/html/default_encoding_under_windows.html) that should be built for handling this type of problems, but it seems like the locale isn't included in Windows since I get _fatal error: boost/locale.hpp: No such file or directory_ while other parts load fine (e.g. `#include `) – Max Gordon Jul 28 '15 at 19:24

0 Answers0