embedded nul in string importing raw data of Content-Type: text/tab-separated-values; charset=utf-16le

Question

Using httr to obtain a report from site using oath2.0 I am unable to convert the raw content into a character set within R.

 > req <-GET("https://www.blah.com/blah/v2/blah", config(token = token))

My response indicates no issue:

 Response [https://www.blah.com/blah/v2/blah]
 Date: 2018-09-21 15:55
 Status: 200
 Content-Type: text/tab-separated-values; charset=utf-16le
 Size: 21.1 MB
NA

When attempting to convert my raw data to char I get:

> rawToChar(req$content)
Error in rawToChar(req$content) : 
embedded nul in string:

I also obtain the following error when checking content via content():

> content(req)
Error in guess_header_(datasource, tokenizer, locale) :
Incomplete multibyte sequence

Any thoughts? I've found limited resources on this on the web...

I know you can use the `skipNul` flag with `read.table`. Without seeing what it is, it's hard to try to help. Maybe `read.table` first then convert the raw to char. — Anonymous coward, Sep 21 '18 at 16:45
Not sure if it will be helpful here. Basically the data is something like this: 2d 00 31 00 33 00. I think the 00 is the actual tab space causing the problem in the raw data — Redeyes10, Sep 21 '18 at 17:02

Redeyes10 · Accepted Answer · 2018-09-25T12:01:56.003

4

For reference. For raw structures, '00' indicates a NUL. Solution is to remove all NUL values then convert to char.

 > dat <- req$content
 > up_dat <- dat[!dat=='00']
 > rawToChar(up_dat)

Removing had no effect on overall data structure once transformed.

In this case,

  readr::read_tsv()

worked just fine.

edited Sep 25 '18 at 12:01

answered Sep 22 '18 at 02:51

Redeyes10

199
2
12

The source data is being send in UTF-16LE encoding, as stated by the `Content-Type` header. Blindly stripping off NUL bytes will mangle Unicode characters. Why not just use `iconv()` to convert the raw UTF-16 bytes to UTF-8? Something like `dat <- iconv(req$content, "UTF16-LE", "UTF-8")`. According to the [documentation](https://www.rdocumentation.org/packages/httr/versions/1.3.1/topics/content), `content()` already recognizes `text/tab-separated-values` and should handle the `UTF16-LE` for you. – Remy Lebeau Sep 25 '18 at 00:04
Unfortunately this method did not work. Interesting enough it returned Kanji characters... The other odd thing in my case is that content() failed despite the support of tab delineated files. Additionally, I am seeing no loss of data integrity with the above strip. – Redeyes10 Sep 25 '18 at 12:06

score 0 · Answer 2 · answered Apr 04 '19 at 19:45

0

You could also use readBin() to read in your raw vector. The only thing is that you need to know or guess is the size to use for n. But you can count those by counting the NUL values.

count_nul <- length(dat[dat == 00])
readBin(dat, n = count_nul)

answered Apr 04 '19 at 19:45

phiver

23,048
14
44
56

embedded nul in string importing raw data of Content-Type: text/tab-separated-values; charset=utf-16le

2 Answers2

Linked