What is the protocol / relationship between encodings and programming languages?

Question

As a test I created a file called Hello.java and the contents are as follows:

public class Hello{
    public static void main(String[] args){
        System.out.println("Hello world!");
    }
}

I saved this file with UTF-8 encoding.

Anyway, compiling and running the problem was no problem. This file was 103 bytes long.

I then saved the file with UTF-16 BE encoding. This time the file was 206 bytes long, since well UTF-16 (usually) needs more space, so no surprise here.

Tried compiling the file from my terminal and I got all these errors:

Hello.java:4: error: illegal character: '\u0000'
    }
    ^

So does javac work only with UTF-8 encoded source files? Is that like a standard?

javac -version
javac 1.8.0_45

Also, I only know Java but lets say you are running Python code or any interpreted programming language. (Sorry if I am mistaken by thinking Python is interpreted if it is not..) Would the encoding be a problem? If not, would it have any effect on performance?

Ok so the word "true" is a reserved keyword (for a given programming language..) but in what encoding is it reserved? ASCII - UTF-8 only?

How "true" is stored in the hard drive or in memory depends on the encoding the file is saved in, so must a programming language expect always to work with a particular encoding for source files?

Sami Kuhmonen · Accepted Answer · 2016-01-31T19:23:40.960

1

Regarding javac, you can set the encoding with -encoding parameter. Internally Java handles strings in UTF-16 so the compiler will convert everything to that.

The compiler must know the encoding so it can process the source codes. It doesn't matter what compiler, interpreter or language it is. Just like people can't just take random language text and assume it's German.

Keywords aren't reserves in any specific encoding. They are keywords. You can't have two ways of writing a single word no matter what encoding you use. The words are the same.

Programming language doesn't care about encoding. Compiler/interpreter does.

edited Jan 31 '16 at 19:23

answered Jan 31 '16 at 19:17

Sami Kuhmonen

30,146
9
61
74

Will .class files be always utf-8? – Koray Tugay Jan 31 '16 at 19:22
@KorayTugay Added mention of that. Java uses UTF-16 internally – Sami Kuhmonen Jan 31 '16 at 19:24
What if an interpreter needs to process files in different encodings? Like one file is utf8 but the dependency is utf16? – Koray Tugay Jan 31 '16 at 19:24
@KorayTugay Depends on the meaning of process. If source files, it has to be told what encoding they are in – Sami Kuhmonen Jan 31 '16 at 19:26
I think Java uses UTF-16 for Strings and Charaters during run time, or do you mean all .class files are UTF-16 encoded as well? – Koray Tugay Feb 01 '16 at 06:54
1

@KorayTugay Actually, checked. The class files will have everything saved as UTF8, only runtime they are stored as UTF16 – Sami Kuhmonen Feb 01 '16 at 07:29
So I guess we can say even if I have source code in UTF16, it will be converted to UTF8 class files. I see.. But I still wonder what happens when you run a python program with UTF16 encoded code, which has a dependency to UTF8 encoded file.. – Koray Tugay Feb 01 '16 at 07:32

What is the protocol / relationship between encodings and programming languages?

1 Answers1