Remove byte-order-mark in R/C

Question

This SO post has an example of a server that generates json with a byte order mark. RFC7159 says:

Implementations MUST NOT add a byte order mark to the beginning of a JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.

Currently yajl and hence jsonlite choke on the BOM. I would like to follow the RFC suggestion and ignore the BOM from the UTF8 string if present. What is an efficient way to do this? A naive implementation:

if(substr(json, 1, 1) == "\uFEFF"){
  json <- substring(json, 2)
}

However substr is a bit slow for large strings, and I am not sure this is the correct way to do this. Is there a more efficient way in R or C to remove the BOM if present?

The UTF-8 representation of the BOM will be EF BB BF. – borrible Nov 04 '14 at 22:50 — borrible, Nov 04 '14 at 22:50

score 5 · Accepted Answer · edited Dec 07 '19 at 05:28

5

A simple solution:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
std::string stripBom(std::string x) {
   if (x.size() < 3)
      return x;

   if (x[0] == '\xEF' && x[1] == '\xBB' && x[2] == '\xBF')
      return x.substr(3);

   return x;
}

/*** R
x <- "\uFEFFabcdef"
print(x)
print(stripBom(x))
identical(x, stripBom(x))
utf8ToInt(x)
utf8ToInt(stripBom(x))
*/

gives

> x <- "\uFEFFabcdef"

> print(x)
[1] "abcdef"

> print(stripBom(x))
[1] "abcdef"

> identical(x, stripBom(x))
[1] FALSE

> utf8ToInt(x)
[1] 65279    97    98    99   100   101   102

> utf8ToInt(stripBom(x))
[1]  97  98  99 100 101 102

EDIT: What might also be useful is seeing how R does it internally -- there are a number of situations where R strips BOM (e.g. for its scanners and file readers). See:

https://github.com/wch/r-source/blob/bfe73ecd848198cb9b68427cec7e70c40f96bd72/src/main/scan.c#L455-L458

https://github.com/wch/r-source/blob/bfe73ecd848198cb9b68427cec7e70c40f96bd72/src/main/connections.c#L3950-L3957

edited Dec 07 '19 at 05:28

MichaelChirico

33,841
14
113
198

answered Nov 05 '14 at 02:09

Kevin Ushey

20,530
5
56
88

This does not address the problem in either of the languages the OP tagged. – Vality Nov 05 '14 at 02:13
1) Rcpp is a very important part of the R ecosystem and is useful for demonstrating / prototyping such problems, and that 2) it is trivial to translate this from C++ to C. Note that this example can be executed _in R_ with the Rcpp package, though `Rcpp::sourceCpp()`. – Kevin Ushey Nov 05 '14 at 02:16
Perhaps you would be kind enough to add the C translation then? Or if it is preferable I could probably provide the code for one and make an edit? – Vality Nov 05 '14 at 02:21
@KevinUshey I added commit refs as the `trunk` links will refer to static line numbers not code... `scan.c` appeared to still be correct but `connections.c` definitely moved around since this post, please have a quick check to confirm I picked the right updated lines – MichaelChirico Dec 07 '19 at 05:30
Those look correct to me. Thank you very much for taking a look! – Kevin Ushey Dec 08 '19 at 00:18

Jeroen Ooms · Answer 2 · 2014-11-05T23:17:09.153

Based on Kevin's Rcpp example I used the following C function to check for the bom:

SEXP R_parse(SEXP x) {
  /* get data from R */
  const char* json = translateCharUTF8(asChar(x));

  /* ignore BOM as suggested by RFC */
  if(json[0] == '\xEF' && json[1] == '\xBB' && json[2] == '\xBF'){
    warning("JSON string contains UTF8 byte-order-mark!");
    json = json + 3;
  }

  /* parse json */
  char errbuf[1024];
  yajl_val node = yajl_tree_parse(json, errbuf, sizeof(errbuf));
}

Remove byte-order-mark in R/C

2 Answers2