3

I am implementing a application that calculated the readability of Java files with the readability formula proposed by Posnett, Hindle and Devanbu (here).

The formula is: z = 8.87 - 0.033 * Volume + 0.40 * Lines - 1.5 * Entropy

They say that Entropy is calculated from the counts of terms (tokens or bytes) as well as the number of unique terms and bytes.

I made some research, but couldn't find a definition of a term in Java. The only thing that I found was this, that list some "useful Java terms", but I don't think that these are the only terms in Java.

So, what should I consider as Java terms? Can anyone give me an exaplanation?

João Alves
  • 185
  • 1
  • 5
  • 14

2 Answers2

2

You're confusing different usages of the word "term". Two relevant definitions are:

  • A word/phrase that has a special meaning in a particular context. A biology teacher might say "make sure to study the terms from Chapter 14 for the quiz tomorrow". This is the usage of "term" in your list of "useful Java terms".
  • One element in a sequence of things. For instance, if you have a sequence of characters qwerty, then w is a term because it's one of those characters. This is the definition used in the entropy calculation. Specifically, "term" can mean an individual character (byte) in the source code, or a "token" in Java, which means any part of the code that means one thing in the Java syntax (int foo = bar-3; contains the tokens int, foo, =, bar, -, 3, and ;).

Note: When dealing with programming, a byte is sometimes synonymous with a character because characters are stored with one byte of memory.

k_ssb
  • 6,024
  • 23
  • 47
  • Thank you very much for your answer! You answered exactly to what I needed. But, just to clarify: having, for example, [this](https://pastebin.com/PmvXkWHb) code, there are 17 tokens and 83 bytes, right? I considered these tokens: `public`, `class`, `Example`, `{}`, `{}`, `public`, `void`, `exampleMethod`, `()`, `System`, `.`, `out`, `.`, `println`, `()`, `"Hello, world!"` and `;`. Did I make some mistake when counting the tokens? – João Alves May 14 '18 at 00:25
  • 1
    I would treat `{` and `}` as separate tokens, and similarly for `(` and `)`. So I count 21 tokens. As for the number of bytes, it's not clear if we should count whitespace (indentation) -- I'd just pick something reasonable (either exclude whitespace or not). There's no hard answer to this. – k_ssb May 14 '18 at 00:29
  • If I may, I would like to ask you another question about this subject. I was searching a way to count the tokens and found this answer. Can I use this method to count the number of Java tokens or do I have to parse the code with a parser? @pkpnd – João Alves May 14 '18 at 22:27
  • Counting tokens should be done with a parser. This is getting pretty off-topic from your posted question, so if you need more help with using a parser, please post a new question. – k_ssb May 15 '18 at 04:25
1

It's not specific to Java. There is such a thing as a 'term' in Java, and you will find it in the JLS, but that's not what they're talking about. They are talking about tokens or bytes, in general terms, not language-specific. and in one place tokens and bytes, which appears to be a mistake.

The terms here can be bytes or tokens, and we use both in this paper. [emphasis added]

user207421
  • 305,947
  • 44
  • 307
  • 483
  • So, do you know what exactly should I consider in this specific case as a term? Should it be Java keywords? Operands? Operators? – João Alves May 13 '18 at 23:52
  • 1
    Java tokens: keywords, identifiers, literals, operators, punctuation characters, ... *Or* bytes. – user207421 May 13 '18 at 23:55