24

Hello and thank you for reading my post.

My problem is the following: I want to compile a Java source file with "javac" with this file being UTF-8 encoded with a BOM (the OS is WinXP).

Below is what I do:

1) Create a file with "Notepad" and choose the UTF-8 encoding

dos> notepad Test.java
"File -> Save as..."
File name   : Test.java
Save as type: All Files
Encoding    : UTF-8
Save

2) Create a Java class in that file and saved the file like in 1)

public class Test
{
    public static void main(String [] args)
    {
        System.out.println("This is a test.");
    }
}

3) Visualize the hexadecimal version of the file (first line)

dos> xxd Test.java | head -1
0000000: efbb bf70 7562 6c69 6320 636c 6173 7320  ...public class

Note: ef bb bf is the UTF-8 encoded BOM (the UTF-16 encoded BOM being FE FF).

4) Try to compile this code with "javac"

dos> javac -encoding utf8 Test.java
Test.java:1: illegal character: \65279
?public class Test
^
1 error

Note: 65279 is the decimal version of the BOM.

My question is the following: how can I make this compiling work with:

  • keeping it UTF-8 encoded
  • and keeping the BOM?

Thank you for helping and best regards.

Léa

Léa Massiot
  • 1,928
  • 6
  • 25
  • 43
  • 4
    That’s right: you have to remove the BOM. It has no business in UTF-8, so of course it is an error. This is a long-standing Microsoft bug. Never ever put a BOM in UTF-8!!!!! – tchrist Mar 21 '12 at 20:56
  • Hello. Thank you for your answer. I used "Notepad++" to encode the file as "UTF8 without BOM". Compiling the code with "javac" now works. – Léa Massiot Mar 22 '12 at 09:20
  • 8
    @tchrist [The Unicode Standard (page 30)](http://www.unicode.org/versions/Unicode6.0.0/ch02.pdf) allows for a BOM in UTF-8 so you have every right to put it there if you so wish. Why you'd want to is another story, but `javac` should handle it. – Sled Jul 09 '13 at 18:46
  • possible duplicate of [How to compile a java source file which is encoded as "UTF-8"?](http://stackoverflow.com/questions/1726174/how-to-compile-a-java-source-file-which-is-encoded-as-utf-8) – Joe Jan 20 '15 at 11:01

3 Answers3

36

Trim the BOM and then use javac -encoding utf8 x.java

el fuego
  • 821
  • 9
  • 18
  • This solved my javac compiling problem. But now Windows10 console still showing unknown characters like "???????????". – Manishoaham Jun 16 '19 at 17:26
  • Afaiu, `chcp 65001` should help you with console. – el fuego Jun 17 '19 at 19:54
  • Tried this also, issue not resolved. Open question marks "?????" converted into boxed question marks. Windows console still not recognizing text. Here that shows correct like: लोकसभा के चुनावी रण में सत्तारूढ़ भाजपा की ओर से सिर्फ नरेन्द्र मोदी ही दिखाई दे रहे हैं। – Manishoaham Jun 18 '19 at 11:17
  • This is what I haven't been able to solve for at least 3 months. Thanks for the stackoverflow! – Petr Fořt Fru-Fru Dec 02 '21 at 21:26
19

This isn't a problem with your text editor, it's a problem with javac ! The Unicode spec says BOM is optionnal in UTF-8, it doesn't say it's forbidden ! If a BOM can be there, then javac HAS to handle it, but it doesn't. Actually, using the BOM in UTF-8 files IS useful to distinguish an ANSI-coded file from an Unicode-coded file.

The proposed solution of removing the BOM is only a workaround and not the proper solution.

This bug report indicates that this "problem" will never be fixed : https://web.archive.org/web/20160506002035/http://bugs.java.com/view_bug.do?bug_id=4508058

Since this thread is in the top 2 google results for the "javac BOM" search, I'm leaving this here for future readers.

mklement0
  • 382,024
  • 64
  • 607
  • 775
Etienne Delavennat
  • 1,012
  • 9
  • 10
  • 2
    Actually, the bug you reference has to do with the UTF-8 decoder; it has nothing to do with whether the *compiler* can be altered to detect and discard any BOM on a Java source file, which it can, and should. – Lawrence Dol Jun 06 '18 at 00:01
-1

https://stackoverflow.com/a/28043356/7050261

Actually, using the BOM in UTF-8 files IS useful to distinguish an ANSI-coded file from an Unicode-coded file.

Actually

  • BOM is not about distinguishing ANSI and Unicode. Do not use a feature on purpose it is not designed for.

  • UTF-8 was designed to be backward-compatible with ANSI intentionally, so a lot of code written to process formatted text relied on 0..127 bytes only (XML, JSON, etc.) should work correctly with UTF-8 encoded text without any modifications.

Community
  • 1
  • 1
  • note: it is byte-level compatibility only, but char-level calculations became wrong when UTF-8 used in place of ANSI. – Nashev May 14 '21 at 09:37
  • 1
    UTF-8 is only backward-compatible with _ASCII_ (7-bit range, `0x0 - 0x7F`), not also with _ANSI_ (an ASCII _extension_ that also defines characters in the 8-bit range, `0x80 - 0xFF`, and that range is _not_ compatible with UTF-8). Yes, a BOM in a UTF-8 file serves to distinguish it from an ANSI (or OEM, ....) file. – mklement0 Nov 20 '21 at 20:00