7

I am creating UTF16 text files with Matlab, which I am later reading in using Java. In Matlab, I open a file called fileName and write to it as follows:

fid = fopen(fileName, 'w','n','UTF16-LE');
fprintf(fid,"Some stuff.");

In Java, I can read the text file using the following code:

FileInputStream fileInputStream = new FileInputStream(fileName);
Scanner scanner = new Scanner(fileInputStream, "UTF-16LE"); 
String s = scanner.nextLine();

Here is the hex output:

Offset(h) 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13
00000000  73 00 6F 00 6D 00 65 00 20 00 73 00 74 00 75 00 66 00 66 00  s.o.m.e. .s.t.u.f.f.

The above approach works fine. But, I want to be able to write out the file using UTF16 with a BOM to give me more flexibility so that I don't have to worry about big or little endian. In Matlab, I've coded:

fid = fopen(fileName, 'w','n','UTF16');
fprintf(fid,"Some stuff.");

In Java, I change the code to:

FileInputStream fileInputStream = new FileInputStream(fileName);
Scanner scanner = new Scanner(fileInputStream, "UTF-16");
String s = scanner.nextLine();

In this case, the string s is garbled, because Matlab is not writing the BOM. I can get the Java code to work just fine if I add the BOM manually. With the added BOM, the following file works fine.

Offset(h) 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13 14 15
00000000  FF FE 73 00 6F 00 6D 00 65 00 20 00 73 00 74 00 75 00 66 00 66 00  ÿþs.o.m.e. .s.t.u.f.f.

How can I get Matlab to write out the BOM? I know I could write the BOM out separately, but I'd rather have Matlab do it automatically.

Addendum

I selected the answer below from Amro because it exactly solves the question I posed.

One key discovery for me was the difference between the Unicode Standard and a UTF (Unicode transformation format) (see http://unicode.org/faq/utf_bom.html). The Unicode Standard provides unique identifiers (code points) for characters. UTFs provide mappings of every code point "to a unique byte sequence." Since all but a handful of the characters I am using are in the first 128 code points, I'm going to switch to using UTF-8 as Romeo suggests. UTF-8 is supported by Matlab (The warning shown below won't need to be suppressed.) and Java, and for my application will generate smaller text files.

I suppress the Matlab warning

Warning: The encoding 'UTF-16LE' is not supported.

with

warning off MATLAB:iofun:UnsupportedEncoding;
Community
  • 1
  • 1
Richard Povinelli
  • 1,419
  • 1
  • 14
  • 28

3 Answers3

4

On my system MATLAB reports that UTF-16 is not supported. I think it will be safer to use UTF-8. Besides, UTF-8 will solve your problem with Little Endian/Big Endian.

vharavy
  • 4,881
  • 23
  • 30
  • I would normally use UTF-8, but I need characters that are only in the UTF-16 character set, such as vowels with macrons and breves. So UTF-8 won't work for what I'm doing. – Richard Povinelli Nov 10 '11 at 02:53
  • 2
    @Richard: What on earth are you talking about? UTF-8 and UTF-16 are different encodings for exactly the same characters. Without any understanding of Unicode at all you are unlikely to encounter success. – Hugh Nov 10 '11 at 04:47
  • 2
    @Huw: +1 both UTF-8 and UTF-16 are capable of representing every character in the Unicode character set (there are 1112064 code points according to Wikipedia).. you shouldn't blame him for the misunderstanding though, Unicode is a really complex standard after all :) – Amro Nov 10 '11 at 12:45
  • @Romeo: You are correct. I now understand that UTF-8 is a variable length encoding. The [Wikipedia UTF-8](http://en.wikipedia.org/wiki/UTF-8) page clarified that for me. – Richard Povinelli Nov 10 '11 at 14:05
  • @Huw: What on earth am I talking about? An incorrect understanding of UTF-8. You are correct. I thought that UTF-8 was a renaming of the ASCII standard, which is incorrect. – Richard Povinelli Nov 10 '11 at 14:08
  • @Romeo: Just a side note: Matlab supports UTF-16LE and UTF-16BE, but they generate an warning. I'm going to follow your suggest to use UTF-8 as it completely avoids the big/little endian problem. – Richard Povinelli Nov 10 '11 at 14:14
  • @RichardPovinelli: have you tried my solution, it is working correctly for UTF-16 – Amro Nov 10 '11 at 16:23
2

Try the following code (I am using UNICODE2NATIVE and NATIVE2UNICODE functions to do the conversions):

%# convert string and write as bytes
str = 'Some stuff.';
b = unicode2native(str,'UTF-16');
fid = fopen('utf16.txt','wb');
fwrite(fid, b, '*uint8');
fclose(fid);

We can even check the hex values of the bytes written (first two being the BOM):

>> cellstr(dec2hex(b))'
ans = 
  Columns 1 through 10
    'FF'    'FE'    '53'    '00'    '6F'    '00'    '6D'    '00'    '65'    '00'
  Columns 11 through 20
    '20'    '00'    '73'    '00'    '74'    '00'    '75'    '00'    '66'    '00'
  Columns 21 through 24
    '66'    '00'    '2E'    '00'

>> char(b)
ans =
ÿþS o m e   s t u f f . 

Now we can read the created file using MATLAB's own methods:

%# read bytes and convert back to Unicode string
fid = fopen('utf16.txt', 'rb');
b = fread(fid, '*uint8')';          %'
fclose(fid);
str = native2unicode(b,'UTF-16')

Or use Java methods directly if you prefer:

scanner = java.util.Scanner(java.io.FileInputStream('utf16.txt'), 'UTF-16');
str = scanner.nextLine()
scanner.close()

both should read the string correctly...

Amro
  • 123,847
  • 25
  • 243
  • 454
0

When I try your command:

fid = fopen(fileName, 'w', 'n', 'UTF16');

This is what I see:

>> fid = fopen('foo.txt', 'w', 'n', 'UTF16');
Warning: The encoding 'UTF-16' is not supported.
See the documentation for FOPEN.

Are you sure that you're successfully opening the file the way that you want to? Are you maybe swallowing a warning message somewhere?

  • warning off MATLAB:iofun:UnsupportedEncoding; – Richard Povinelli Nov 10 '11 at 03:30
  • However, the file does get opened as a UTF16-LE text file. I get the same error when I use 'UTF16-LE' as when I use 'UTF16'. Using 'UTF16-BE' gives a big endian encoding. The warning persists even though Matlab does support both UTF16-BE and UTF16-LE. – Richard Povinelli Nov 10 '11 at 03:30