1

I have a file that contains NUL characters.

This file is generated by another program I have no control over, but I have to read it in order to get some crucial information.

Unfortunately, readChar() truncates the output with this warning:

In readChar("output.txt", 1e+05) :   
  truncating string with embedded nuls

Is there a way around this problem?

Dan Chaltiel
  • 7,811
  • 5
  • 47
  • 92

1 Answers1

5

By convention, a text file cannot contain non-printable characters (including NUL). If a file contains such characters, it isn’t a text file — it’s a binary file.

R strictly1 adheres to this convention, and completely disallows NUL characters. You really need to read and treat the data as binary data. This means using readBin and the raw data type:

n = file.size(filename)
buffer = readBin(filename, 'raw', n = n)
# Unfortunately the above has a race condition, so check that the size hasn’t changed!
stopifnot(n == file.size(filename))

Now we can fix the buffer by removing embedded zero bytes. This assumes UTF-x or ASCII encoding! Other encodings might have embedded zero bytes that need to be interpreted!

buffer = buffer[buffer != 0L]
text = rawToChar(buffer)

1 Maybe too strictly …

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • Note that, for very large files, reading the entire file at once using `readBin` might not be a great idea. In that case, it’s probably a better idea to work with a fixed-sized buffer of a few MiB, process data in batches, and write the sanitised data back to disk into a text file. – Konrad Rudolph Dec 14 '22 at 08:47