How can I safely use a Java byte as an unsigned char?

Question

I am porting some C code that uses a lot of bit manipulation into Java. The C code operates under the assumption that int is 32 bits wide and char is 8 bits wide. There are assertions in it that check whether those assumptions are valid.

I have already come to terms with the fact that I'll have to use long in place of unsigned int. But can I safely use byte as a replacement for unsigned char?

They merely represent bytes, but I have already run into this bizarre incident: (data is an unsigned char * in C and a byte[] in Java):

/* C */
uInt32 c = (data[0] << 24) | (data[1] << 16) | (data[2] << 8) | data[3];

/* Java */
long a = ((data[0] << 24) | (data[1] << 16) | (data[2] << 8) | data[3]) & 0xffffffff;
long b = ((data[0] & 0xff) << 24) | ((data[1] & 0xff) << 16) |
          ((data[2] & 0xff) << 8) | (data[3] & 0xff) & 0xffffffff;

You would think a left shift operation is safe. But due strange unary promotion rules in Java, a and b are not going to be the same if some of the bytes in data are "negative" (b gives the correct result).

What other "gotchas" should I be aware of? I really don't want to use short here.

There is a non-sign-extending version of the downshift operator, '>>>'. Java doesn't have an actual unsigned type. — keshlam, Jul 04 '15 at 05:23
@keshlam I don't think that would have helped in this particular case. — Confluence, Jul 04 '15 at 05:25
Left-shift does the right thing either way, though the bitwise result is expressed as a negative number if the top bit is set.You may need parens to control the order of operations, though. — keshlam, Jul 04 '15 at 05:39
Use a `ByteBuffer` for your use case; also, do you perform arithmetics on the results? Or is it simply to display it? — fge, Jul 04 '15 at 06:30
Assuming `data[0]` is an `unsigned char`, `data[0] << 24` is probably a *bad idea* because of [6.5.7p3](http://www.iso-9899.info/n1570.html#6.5.7p3)... `data[0]` would be promoted to an `int` (a C `int`), which doesn't necessarily have 32 bits. — autistic, Jul 04 '15 at 06:33
@fge I was referring to the C code, hence the reason I cited the C standard... *Hmmm, a question about C and Java... Java has no UB, so there couldn't be any UB here!* — autistic, Jul 04 '15 at 06:34
@undefinedbehaviour But that's UB iff `int` doesn't have at least 32 bits, correct? — Confluence, Jul 04 '15 at 15:28
@fge Yes, I could use `ByteBuffer`. The priority right now is to get the code to work, and a more 1:1 translation helps. The arithmetic is performed on the ints built using these bytes; and that worries me more. There is also pointer typecasting, which worries me the most. — Confluence, Jul 04 '15 at 15:43

score 5 · Accepted Answer · answered Jul 04 '15 at 06:03

You can safely use a byte to represent a value between 0 and 255 if you make sure to bitwise-AND its value with 255 (or 0xFF) before using it in computations. This promotes it to an int, and ensures the promoted value is between 0 and 255.

Otherwise, integer promotion would result in an int value between -128 and 127, using sign extension. -127 as a byte (hex 0x81) would become -127 as an int (hex 0xFFFFFF81).

So you can do this:

long a = (((data[0] & 255) << 24) | ((data[1] & 255) << 16) | ((data[2] & 255) << 8) | (data[3] & 255)) & 0xffffffff;

Note that the first & 255 is unnecessary here, since a later step masks off the extra bits anyway (& 0xffffffff). But it's probably simplest to just always include it.

Isn't that exactly what the OP did for b? – MikeMB Jul 04 '15 at 07:14 — MikeMB, Jul 04 '15 at 07:14

autistic · Answer 2 · 2015-07-04T06:37:14.757

-1

... can I safely use byte as a replacement for unsigned char?

As you've discovered, not really... No.

According to Oracle Java documentation, byte is a signed integer type, and though it has 256 distinct values (due to the explicit range specification "It has a minimum value of -128 and a maximum value of 127 (inclusive)" from the documentation) there are values that an unsigned char from C can store, that a byte from Java can't (and vice-versa).

That explains the problem you've experienced. However, the extent of the problem hasn't been fully demonstrated on your 8-bit-byte implementation.

What other "gotchas" should I be aware of?

Whilst a byte in Java is required to have support for only values between (and including) -128 and 127, Cs unsigned char has maximum value (UCHAR_MAX) that depends upon the number of bits used to represent it (CHAR_BIT; at least 8). So when CHAR_BIT is greater than 8, there will be extra values beyond 255 that unsigned char can store.

In summary, in the world of Java a byte should really be called an octet (a group of eight bits) where-as in C a byte (char, signed char, unsigned char) is a group of at least (possibly more than) eight bits.

No. They are not equivalent. I don't think you'll find an equivalent type in Java, either; they're all rather fixed-width. You could safely use byte in Java as an equivalent for int8_t in C, however (except that int8_t isn't required to exist in C unless CHAR_BIT == 8).

As for pitfalls, there are some in your C code too. Assuming data[0] is an unsigned char, data[0] << 24 is undefined behaviour on any system for which INT_MAX == 32767.

edited Jul 04 '15 at 06:37

answered Jul 04 '15 at 06:16

autistic

1
3
35
80

1

The OP is porting (apparently working) C Code that obviously assumes 8-bit chars and >=32 ints, to java which has fixed data types anyway and you are arguing under which circumstances the original code would not work (which are completely irrelevant for the java code)? – MikeMB Jul 04 '15 at 07:36
@MikeMB Where do you get such *obvious assumptions* from? If he is operating under the assumption that `int` has 32 bits, then why does he use `uInt32` instead of `unsigned int`? That aside, the OP asked two questions and I answered them both, correlating the questions asked to responses underneath. I feel as though I've been wrongfully downvoted, but -shrugs- what goes around has it's way of coming around, when you establish a reputation of writing comments along the lines of *stackoverflow isn't forever*. – autistic Jul 04 '15 at 07:46
I said greater or equal 32 bits and I'm making that assumption, because as you pointed out yourself, the C-Code wouldn't work otherwise in the first place. I've no problem with the first part of your answer (although one could explain that you can use byte for some operations as shown by the OP and immibis), but everything else is - in my opinion - just irrelevant for porting the code at hand. For a question with the aim of writing correct java code, I think your answer focuses far too much on how char and int could be implemented in C. – MikeMB Jul 04 '15 at 08:32
@MikeMB Undefined behaviour is undefined, which means you can't define it as "wouldn't work otherwise in the first place". The point of pointing out that "gotcha" (which is what was asked for, btw) is saving the O.P. an awkward bug report or transition to a new (or old, whatever) system, particularly after *a lot* of code has been written under non-portable assumptions. On that note, can you tell me anything that I *didn't* answer, that was asked? Or are you simply complaining because you don't like something I wrote, regardless of how true it might be? – autistic Jul 04 '15 at 11:25
I didn't want to complain. I downvoted your answer and explained why I did it. Your point about UB would be important if the OP intended to port this code from one architecture to another or to a different compiler. What he asked was what to keep in mind wen translating that code to java. Whether the C-Code works under all circumstances is beside the point here, as long as its intended functionality is clear, because you usually can't do a 1 to 1 translation anyway. Even if it where possible, a construct that is UB in one language might give you exactly what you want in the other. – MikeMB Jul 04 '15 at 14:51
You don't have to agree with me. It's not like I flagged your post or anything - just expressing my opinion on its usefullness in the context of this question. – MikeMB Jul 04 '15 at 14:55
1

I am sorry if I didn't make this clear in the question, but the C code operates under the assumption that `int` is 32 bits wide and `char` is 8 bits wide. There are assertions in the C code that check whether those assumptions are valid. This is still a good answer nonetheless, and might be helpful to someone else reading it. +1 – Confluence Jul 04 '15 at 15:33

How can I safely use a Java byte as an unsigned char?

2 Answers2