Why is the internal data of BitSet in java stored as long[] instead of int[] in Java?

Question

In java, the internal data of BitSet is stored as long[] instead of int[], I want to know why? Here is the code in jdk:

 /**
 * The internal field corresponding to the serialField "bits".
 */
 private long[] words;

If it's all about performance, I wonder why long[] storage will get better performance.

I want to know why not? What exactly would be better about storing it as an int[]? — user207421, Aug 20 '15 at 07:14
If I'd have to guess I'd say it's probably related with the fact that nowadays most people are on 64 bits machines / operating systems and thus operations over longs tend to be better supported / faster. I don't really buy Santi's argument. — devoured elysium, Aug 20 '15 at 07:16
@devouredelysium I think "most people are on 64 bits" is a very bold statement to make indeed. Do you have any evidence other than anecdotal to back this up? Given that Java is designed to run on myriad platforms (including many embedded systems) I really don't think that would be the reason behind the design decision. — daiscog, Aug 20 '15 at 08:08
@daiscog: yet they chose longs over ints. If you look at the source code of BitSet it is clearly stated that: "BitSets are packed into arrays of "words."". — devoured elysium, Aug 20 '15 at 09:07
@devouredelysium Sorry, I think you may have misunderstood me. I know they chose longs over ints, but I'm saying that I don't believe the reason you gave is why they did so. — daiscog, Aug 20 '15 at 09:44

score 12 · Answer 1 · edited Aug 20 '15 at 14:25

12

On 64-bit machines performing bitwise operations on single long value are significantly more performant than the same operations on two int values as 64-bit values are directly supported by hardware. On 32-bit machines the difference is probably not very significant.

edited Aug 20 '15 at 14:25

Willi Mentzel

27,862
20
113
121

answered Aug 20 '15 at 07:22

Tagir Valeev

97,161
19
222
334

On 32-bit machines, would the performance be better with int[] or long[]? – displayName Aug 20 '15 at 13:43
@displayName, you can write your own int[]-based `BitSet` and test it by yourself :-) See also @Holger answer. – Tagir Valeev Aug 20 '15 at 14:13
@displayName I would say int[] cause it's native :D but Tagir is right... we should implement and see – Willi Mentzel Aug 20 '15 at 14:26
@TagirValeev: Your response to my comment is best. :) I asked still because I thought you might know it already. I have been working with C# for past ~2 years. Starting eclipse on my laptop with require a lot of 'warming up'. :D – displayName Aug 20 '15 at 14:47
@displayName: keep in mind that even 32Bit CPUs may have a 64Bit memory bus as well as a few appropriate 64Bit operations. It’s not about general arithmetic but just certain bit manipulations. Further, the JVM may have knowledge about the standard `BitSet` class and replace certain operations if the aren’t “native enough”. The bigger obstacle is to find an appropriate test machine with a *true* 32 Bit architecture… – Holger Aug 20 '15 at 18:42

score 12 · Accepted Answer · answered Aug 20 '15 at 10:23

When querying or manipulating a single bit, there is no significant difference. You have to calculate the word index and read that word and, in case of an update, manipulate one bit of that word and write it back. That’s all the same for int[] and long[].

One could argue that doing it using a long instead of int could raise the amount of memory that has to be transferred for a single bit operation if you have a real 32 bit memory bus, but since Java was designed in the nineties of the last century, the designers decided that this is not an issue anymore.

On the other hand, you get a big win when processing multiple bits at once. When you perform operations like and, or or xor on an entire BitSet, you can perform the operation on an entire word, read 64 bits, at once when using a long array.

Similarly, when searching for the next set bit, if the bit is not within the word of the start position, subsequent words are first tested against zero, which is an intrinsic operation, even for most 32 bit CPUs, so you can skip 64 zero bits at once while the first non-zero word will definitely contain the next set bit, so only one bit extraction operation is needed for the entire iteration.

These benefits for bulk operations will outweigh any single-bit related drawbacks, if there ever are one. As said, most today’s CPU are capable of doing all operations on 64 bit words directly.

Since BitSet is a datasture, when taling about performance, we should consider the operation it supports. You are the first man who analysis the performance in this way. Great job! — jerry_sjtu, Aug 21 '15 at 01:16

score 6 · Answer 3 · answered Aug 20 '15 at 07:56

6

Based on cursory reading of the source here. Seems like, the main cause is purely for performance. This is the comment retrieved from the source.

BitSets are packed into arrays of "words." Currently a word is a long, which consists of 64 bits, requiring 6 address bits. The choice of word size is determined purely by performance concerns.

answered Aug 20 '15 at 07:56

kucing_terbang

4,991
2
22
28

If it's all about performance, I wonder why long[] storage will get better performance. – jerry_sjtu Aug 20 '15 at 09:57
It will be faster for the class to resize it's capacity by 64 than by 32. – kucing_terbang Aug 20 '15 at 10:02

score 1 · Answer 4 · answered Aug 20 '15 at 07:10

1

Surely is an optimization issue: A single long value stores up to 64 bits, and int only 32. So, any user length under 64 needs only one entry in the array. If it was an array of int, it would have need two entries, which is slower and heavier to maintain.

answered Aug 20 '15 at 07:10

Little Santi

8,563
2
18
46

1

Why is it slower and why is it harder to maintain? – devoured elysium Aug 20 '15 at 07:19
1

Obviously, storing two items in an array requires double time than storing just one. – Little Santi Aug 20 '15 at 07:27
Not necessarily. It will largely depend on your hardware. – devoured elysium Aug 20 '15 at 07:52
1

I don't mean storing values inline, but _within a loop_, which is the scenario in a generic solution like BitSet: `for (int i=0;i – Little Santi Aug 20 '15 at 08:00
This is precisely it, for me. If I were designing the class, I'd choose an array of longs for this very reason (except "heavier to maintain" which I don't agree with). – daiscog Aug 20 '15 at 08:13
Still, not necessarily. You generally want to work with words, as they tend to be faster. – devoured elysium Aug 20 '15 at 09:08
True, but looking at the implementation of the class, the way the array is accessed and how all the various methods work, I think this is the correct answer. – daiscog Aug 20 '15 at 09:47

score 1 · Answer 5 · answered Aug 20 '15 at 07:24

1

I might be wrong but with using long[] the cardinality of bitSet is much bigger than when using the int[]. Because the max size of array is quite similar for both of them (yet limited to heap size).

answered Aug 20 '15 at 07:24

Lemonov

476
4
17

This superficially makes sense; the max elements of an array means an array of longs gives you more possible bits. However, I don't think that the use case of someone wanting to have that many flags is a realistic enough concern for this to be the real reason. – daiscog Aug 20 '15 at 08:19
The methods of `BitSet` use `int` index parameters and return values, hence, are limited to 2³¹ bits due to the API. So it doesn’t really matter whether the theoretical limit imposed by its internal array is 32 times higher or 64 times higher. Even a `byte` array could store more bits than the API supports. – Holger Aug 20 '15 at 09:43
Yes the bitSet internal array is indexed with integer but the values are long so we get more possible bits held than when using int. – Lemonov Aug 20 '15 at 09:49
I’m not talking about the array index but about the *API* of `BitSet`. Try to [set](http://docs.oracle.com/javase/8/docs/api/java/util/BitSet.html#set-int-) a bit with a number higher than 2³¹—it’s impossible as there is no method offering that operation. Similarly, [`size()`](http://docs.oracle.com/javase/8/docs/api/java/util/BitSet.html#size--) returns an `int`. So you don’t get more bits by using `long[]` instead of `int[]` internally. It’s still at most 2³¹ with the current API. – Holger Aug 20 '15 at 18:31

Why is the internal data of BitSet in java stored as long[] instead of int[] in Java?

5 Answers5

Linked