0

I'm trying to convert a file System.Web.WebPages.Razor.dll.refresh from ASCII to UTF-16LE. When I run the file -i command on other refresh files in the directory, I get something like:

System.Web.Optimization.dll.refresh: text/plain; charset=utf-16le

And when I run it on my target file I get:

System.Web.WebPages.Razor.dll.refresh: text/plain; charset=us-ascii

I think this encoding difference is causing an error in my build pipeline, so I'm trying to convert this ASCII file to UTF-16LE so it's like the other refresh files. However, iconv doesn't seem to be giving me the output I'm looking for.

My command:

iconv -f US-ASCII -t UTF-16LE "System.Web.WebPages.Razor.dll.refresh" > "System.Web.WebPages.Razor.dll.refresh.new" && mv -f "System.Web.WebPages.Razor.dll.refresh.new" "System.Web.WebPages.Razor.dll.refresh"

There are two issues with the output.

1) It spaces the file out (i.e. from this to t h i s).

2) When I run file -i on this new file, I get the following output:

System.Web.WebPages.Razor.dll.refresh: application/octet-stream; charset=binary

Why am I getting this binary output, and why is it spacing out the text? Is there a better way to convert this file to the proper encoding?

Ben Bynum
  • 318
  • 2
  • 10
  • 1
    If you're looking at a UTF-16 encoded file in something that expects a one-byte code unit instead of it's 2-byte units, yeah, you're going to get funny results. – Shawn Aug 19 '19 at 18:55

1 Answers1

2

file is showing your new file as binary data because it relies on a leading Byte Order Mark to tell if the contents are encoded in UTF-16. When you specify the endianness, iconv will leave out the BOM:

$ iconv -f us-ascii -t utf16le <<<test | xxd
00000000: 7400 6500 7300 7400 0a00                 t.e.s.t...

However, if you let it use the native endianness (Which on typical modern hardware is going to be LE 99% of the time):

$ iconv -f us-ascii -t utf16 <<<test | xxd
00000000: fffe 7400 6500 7300 7400 0a00            ..t.e.s.t...

the mark is there, and file -i will report it as foo.txt: text/plain; charset=utf-16le.

I'm not aware of a way to force iconv to always add the BOM with an explicit UTF-16 endianness. Instead, here's a perl one-liner that will convert to explicit UTF-16LE and add the BOM:

perl -0777 -pe 'BEGIN{binmode STDOUT,":encoding(utf16le)"; print "\x{FEFF}"}' in.txt > out.txt

Or alternatively using printf to print the LE-encoded BOM and iconv for the rest:

(printf "\xFF\xFE"; iconv -f us-ascii -t utf-16le in.txt) > out.txt
Shawn
  • 47,241
  • 3
  • 26
  • 60