Is it a good idea to use unicode symbols as Java identifiers?

Question

I have a snippet of code that looks like this:

double Δt = lastPollTime - pollTime;
double α = 1 - Math.exp(-Δt / τ);
average += α * (x - average);

Just how bad an idea is it to use unicode characters in Java identifiers? Or is this perfectly acceptable?

I'm not sure whether I just upvoted that, or downvoted it... — Thomas, May 08 '10 at 12:00
On a side note, you may be interested in checking out the Fortress language, developed at Sun by (among others) Guy L Steele. It supports a wide range of Unicode operators and even the ASCII ones can be 'pretty-printed' into Unicode -- see http://projectfortress.sun.com/Projects/Community/wiki/MathSyntaxInFortress — Cowan, May 08 '10 at 12:46
It reminds me of [APL](http://en.wikipedia.org/wiki/APL_%28programming_language%29). Tell me how comfortable you would be using that as a programming language? — bart, May 08 '10 at 13:23

Thomas · Accepted Answer · 2010-05-08T13:45:11.650

It's a bad idea, for various reasons.

Many people's keyboards do not support these characters. If I were to maintain that code on a qwerty keyboard (or any other without Greek letters), I'd have to copy and paste those characters all the time.
Some people's editors or terminals might not display these characters properly. For example, some editors (unfortunately) still default to some ISO-8859 (Latin) variant. The main reason why ASCII is still so prevalent is that it nearly always works.
Even if the characters can be rendered properly, they may cause confusion. Straight from Sun (emphasis mine):

Identifiers that have the same external appearance may yet be different. For example, the identifiers consisting of the single letters LATIN CAPITAL LETTER A (A, \u0041), LATIN SMALL LETTER A (a, \u0061), GREEK CAPITAL LETTER ALPHA (A, \u0391), CYRILLIC SMALL LETTER A (a, \u0430) and MATHEMATICAL BOLD ITALIC SMALL A (a, \ud835\udc82) are all different.

...

Unicode composite characters are different from the decomposed characters. For example, a LATIN CAPITAL LETTER A ACUTE (Á, \u00c1) could be considered to be the same as a LATIN CAPITAL LETTER A (A, \u0041) immediately followed by a NON-SPACING ACUTE (´, \u0301) when sorting, but these are different in identifiers.

This is in no way an imaginary problem: α (U+03b1 GREEK SMALL LETTER ALPHA) and ⍺ (U+237a APL FUNCTIONAL SYMBOL ALPHA) are different characters!
There is no way to tell which characters are valid. The characters from your code work, but when I use the FUNCTIONAL SYMBOL ALPHA my Java compiler complains about "illegal character: \9082". Even though the functional symbol would be more appropriate in this code. There seems to be no solid rule about which characters are acceptable, except asking Character.isJavaIdentifierPart().
Even though you may get it to compile, it seems doubtful that all Java virtual machine implementations have been rigorously tested with Unicode identifiers. If these characters are only used for variables in method scope, they should get compiled away, but if they are class members, they will end up in the .class file as well, possibly breaking your program on buggy JVM implementations.

To expand on the last point: you're dependent on the default file encoding of the underlying platform. Although this is controllable using `-Dfile.encoding` on Sun JVM's (yes, JVM implementation dependent...), you *really* don't want to be dependent on that. That's the major showstopper imo. Great answer btw, +1. — BalusC, May 09 '10 at 02:22
@BalusC: Thanks, but I think you misunderstood. In the internals of `.class` files, only one encoding is used, and it's something similar to UTF-8. http://en.wikipedia.org/wiki/Class_%28file_format%29 As far as I could determine, `file.encoding` is only used to specify the default encoding for classes like `InputStreamReader`. — Thomas, May 09 '10 at 08:55

score 9 · Answer 2 · answered May 08 '10 at 11:04

looks good as it uses the correct symbols, but how many of your team will know the keystrokes for those symbols?

I would use an english representation just to make it easier to type. And others might not have a character set that supports those symbols set up on their pc.

score 7 · Answer 3 · answered May 08 '10 at 11:22

7

That code is fine to read, but horrible to maintain - I suggest use plain English identifiers like so:

double deltaTime = lastPollTime - pollTime;
double alpha = 1 - Math.exp(-delta....

answered May 08 '10 at 11:22

Crozin

43,890
13
88
135

JUST MY correct OPINION · Answer 4 · 2010-05-09T02:10:58.090

4

It is perfectly acceptable if it is acceptable in your working group. A lot of the answers here operate on the arrogant assumption that everybody programs in English. Non-English programmers are by no means rare these days and they're getting less rare at an accelerating rate. Why should they restrict themselves to English versions when they have a perfectly good language at their disposal?

Anglophone arrogance aside, there are other legitimate reasons for using non-English identifiers. If you're writing mathematics packages, for example, using Greek is fine if your target is fellow mathematicians. Why should people type out "delta" in your workgroup when everybody can understand "Δ" and likely type it more quickly? Almost any problem domain will have its own jargon and sometimes that jargon is expressed in something other than the Latin alphabet. Why on Earth would you want to try and jam everything into ASCII?

edited May 09 '10 at 02:10

answered May 08 '10 at 11:30

JUST MY correct OPINION

35,674
17
77
99

Absolutely agree; I think if the working group considers it acceptable, easy to type, and more clear, go for it. The only weird thing about doing this is that it is, in a way, a 'fluke' that a character like Δ is a valid Java identifier start, because it's a 'letter'. Other characters with similar uses don't happen to be 'letters', and hence are invalid. – Cowan May 08 '10 at 12:42
-1 for "you suck because you only know English". Until someone invents a spoken language like Python I will not have any reason to learn it. Although everyone in the world should only speak one language. Language is a basic need, not a game, like programming. It's okay to use algebraic symbols though _when you're in a specific domain_. – L̲̳o̲̳̳n̲̳̳g̲̳̳p̲̳o̲̳̳k̲̳̳e̲̳̳ May 08 '10 at 14:07
4

@Longpoke: Please point to where I said "you suck because you only know English". (Hint: This is not possible.) Hell, point to where I even *inferred* this. (Hint: This, too, is not possible.) What I am pointing out, however, is that the people saying "don't use Unicode in identifiers because it makes things difficult to read" are taking the **very** arrogant attitude that only English-speaking programmers count. Hence "anglophone arrogance". – JUST MY correct OPINION May 08 '10 at 14:56
6

The problem is that the _keywords_ in Java are English. `if`, `while`, `public`, `class` etc, as well as all methods in the runtime library. By using another language for identifiers and methods, you have a situation where the reader must mentally switch continuously between two languages when reading the code. That is simply harder than having only one language, even if the reader is proficient in both. – Thorbjørn Ravn Andersen May 08 '10 at 17:46
@ttmrichter (unrelated to this answer) could you undelete your answer here - http://stackoverflow.com/questions/2707516/is-javaee-really-portable :) – Bozho May 08 '10 at 20:39
3

@Thorbjørn: The keywords in Java are pseudo-English. The "if" of Java is not the "if" of English. It is the "if" of formal logic which bears only a passing resemblance to English. The same is true of "while", "public", "class", et al. These are not words. They are symbols. We do not process them as English words. We process them as symbols which have a specified meaning in Java only (and often a completely different meaning in another programming language!). So we're ALREADY switching continuously between two languages. By using identifiers in our native tongue this is explicit. – JUST MY correct OPINION May 09 '10 at 02:08
@Bozho: I don't even remember deleting that nor why I did. Mysterious. It's undone. – JUST MY correct OPINION May 09 '10 at 02:08
@Thorbjørn @ttmrichter: It would probably make more sense to encode keywords such as `if` and `while` in the source code as some symbol, or even just leave them as they are now, then let the IDE translate them to the user's language. Yes they don't directly map onto spoken language, but they are very close, when I see `if (x == 2) { f() }` I think, "if `x` is equal to 2, call `f()`", maybe it's not like this in other human languages, who knows. – L̲̳o̲̳̳n̲̳̳g̲̳̳p̲̳o̲̳̳k̲̳̳e̲̳̳ May 09 '10 at 02:56
1

@Longpoke: It is, in fact, not like this in several other human languages. The things most people think they know about grammar are completely wrong. SVO, for example, is not only not universal, the very notions of "subject" and "object" are not universal. (Linguists use the terms "agent", "experiencer" and "patient" and describe linguistic cases in terms of these.) Conditional structures are not the same across languages. Double-negatives are not positives in many languages, they're emphasizers. "Not not red" means "very not-red" instead of "red". That kind of stuff. – JUST MY correct OPINION May 09 '10 at 03:15
2

@ttmrichter, you may be somewhat right in terms of the keywords, but not in the terms of the identifiers used in the runtime library. It is close to impossible to write any non-trivial Java program without referring to the runtime library and that contains tons of camel cased English words. And, yes, I speak from personal experience. The attempts we have done so far to write Danish words into Java programs did not go very well, and I've concluded the language switching is the case. The sole exception would be domain specific concepts with no reasonable English translation. – Thorbjørn Ravn Andersen May 09 '10 at 09:06
@LongPoke, too many symbols also make programs unreadable. Case in point: APL. The COBOL language is old and looked down upon, but it is so English like that you can frequently understand what it does by just reading the words. Readability is probably the most important aspect of programs besides doing what they are intented to do. – Thorbjørn Ravn Andersen May 09 '10 at 09:11
1

@Thorbjørn: First, readability is in the eye of the beholder. A Chinese user is going to have a different idea of what 'readable' means than is a German or Swedish or English user. Second, the (standard) runtime library is one of my complaints with Java, precisely because it's a huge, chatty mess of English. – JUST MY correct OPINION May 09 '10 at 09:35
@ttmrichter. I acknowledge you know a lot more about the Chinese mindset than me, and that it may work different for very-non-English speakers. Will you, in turn, acknowledge that the two-language mindset at least for Danish speakers makes it less readable than just one? The Runtime library is as it is. English. What would you suggest instead? – Thorbjørn Ravn Andersen May 09 '10 at 13:26
1

The problem is I'm also a near-native German and a semi-competent French speaker. I have no personal difficulty switching from German to English and back when reading code written by Germans. Indeed I find Germans writing English in code/comments more distracting than their writing German because their English is usually so non-idiomatic. So, from personal experience, I'm going to have to say I still disagree. Of course I disagree from the perspective of a native ENGLISH speaker dealing with foreign writers of code. I'm not sure how it would feel were I a German writing code. – JUST MY correct OPINION May 09 '10 at 14:49
I'm not an English native speaker but I disagree. Adding accents or other complicated things in ids is just calling for bugs. Maybe not in Java but in JSP and EL for sure. An what if I'm French, one guy is Chinese and the other is from Saudi Arabia? We should find a language that everybody is comfortable with... And then, what if the French guys goes away and an English comes in? – Paolo Apr 22 '14 at 15:24

score 2 · Answer 5 · answered May 08 '10 at 11:39

It's an excellent idea. Honest. It's just not easily practicable at the time. Let's keep a reference to it for the future. I would love to see triangles, circles, squares, etc... as part of program code. But for now, please do try to re-write it, the way Crozin suggests.

score 1 · Answer 6 · answered May 08 '10 at 11:04

1

Why not? If the people working on that code can type those easily, it's acceptable.

But god help those who can't display unicode, or who can't type them.

answered May 08 '10 at 11:04

LukeN

5,590
1
25
33

3

Anybody who can't display Unicode by this point needs to get out of the '80s and into the 21st century. I mean flipping RSTS/E had the beginnings of i18n in place! – JUST MY correct OPINION May 08 '10 at 11:25
1

@ttmrichter: You would be right if there weren't a huge number of misconfigured machines and outdated software around... – Thomas May 08 '10 at 12:05
Also in the unix and linux world there's a lot of people using vim or emacs inside the console to do their stuff, and there's no guarantee they can see or write unicode characters. – LukeN May 08 '10 at 12:32
2

If vim and emacs can't display characters from a standard that's been around for almost two decades, perhaps their reputation as a productive developer tool is drastically overrated. Or if it's the Unix systems' fault, perhaps Unix isn't the be-all/end-all system it's cracked up to be. Seriously. Get with the 21st century. It's lovely up here. (Thankfully my Linux box seems to cope with the 21st century just fine, given where I live and all that.) – JUST MY correct OPINION May 08 '10 at 12:58

score 1 · Answer 7 · answered May 08 '10 at 12:00

In a perfect world, this would be the recommended way.

Unfortunately you run into character encodings when moving outside of plain 7-bit ASCII characters (UTF-8 is different from ISO-Latin-1 is different from UTF-16 etc), meaning that you eventually will run into problems. This has happened to me when moving from Windows to Linux. Our national scandinavian characters broke in the process, but fortunately was only in strings. We then used the \u encoding for all those.

If you can be absolutely certain that you will never, ever run into such a thing - for instance if your files contain a proper BOM - then by all means, do this. It will make your code more readable. If at least the smallest amount of doubt, then don't.

(Please note that the "use non-English languages" is a different matter. I'm just thinking in using symbols instead of letters).

Those symbols *are* non-English languages. Delta and alpha are Greek. That's a language. That isn't English. — JUST MY correct OPINION, May 08 '10 at 12:59
@ttmricher, I was referring to using identifiers in your native language as opposed to use the English terms. (Like Cheval instead of Horse if French). This is different from using "Δ" in the _mathematical_ sense as asked. — Thorbjørn Ravn Andersen, May 08 '10 at 13:41

Is it a good idea to use unicode symbols as Java identifiers?

7 Answers7

Linked

Related