14

In java, the internal data of BitSet is stored as long[] instead of int[], I want to know why? Here is the code in jdk:

 /**
 * The internal field corresponding to the serialField "bits".
 */
 private long[] words;

If it's all about performance, I wonder why long[] storage will get better performance.

Tagir Valeev
  • 97,161
  • 19
  • 222
  • 334
jerry_sjtu
  • 5,216
  • 8
  • 29
  • 42
  • 2
    I want to know why not? What exactly would be better about storing it as an int[]? – user207421 Aug 20 '15 at 07:14
  • 1
    If I'd have to guess I'd say it's probably related with the fact that nowadays most people are on 64 bits machines / operating systems and thus operations over longs tend to be better supported / faster. I don't really buy Santi's argument. – devoured elysium Aug 20 '15 at 07:16
  • @devouredelysium I think "most people are on 64 bits" is a very bold statement to make indeed. Do you have any evidence other than anecdotal to back this up? Given that Java is designed to run on myriad platforms (including many embedded systems) I really don't think that would be the reason behind the design decision. – daiscog Aug 20 '15 at 08:08
  • @daiscog: yet they chose longs over ints. If you look at the source code of BitSet it is clearly stated that: "BitSets are packed into arrays of "words."". – devoured elysium Aug 20 '15 at 09:07
  • @devouredelysium Sorry, I think you may have misunderstood me. I know they chose longs over ints, but I'm saying that I don't believe the reason you gave is why they did so. – daiscog Aug 20 '15 at 09:44

5 Answers5

12

On 64-bit machines performing bitwise operations on single long value are significantly more performant than the same operations on two int values as 64-bit values are directly supported by hardware. On 32-bit machines the difference is probably not very significant.

Willi Mentzel
  • 27,862
  • 20
  • 113
  • 121
Tagir Valeev
  • 97,161
  • 19
  • 222
  • 334
  • On 32-bit machines, would the performance be better with int[] or long[]? – displayName Aug 20 '15 at 13:43
  • @displayName, you can write your own int[]-based `BitSet` and test it by yourself :-) See also @Holger answer. – Tagir Valeev Aug 20 '15 at 14:13
  • @displayName I would say int[] cause it's native :D but Tagir is right... we should implement and see – Willi Mentzel Aug 20 '15 at 14:26
  • @TagirValeev: Your response to my comment is best. :) I asked still because I thought you might know it already. I have been working with C# for past ~2 years. Starting eclipse on my laptop with require a lot of 'warming up'. :D – displayName Aug 20 '15 at 14:47
  • @displayName: keep in mind that even 32Bit CPUs may have a 64Bit memory bus as well as a few appropriate 64Bit operations. It’s not about general arithmetic but just certain bit manipulations. Further, the JVM may have knowledge about the standard `BitSet` class and replace certain operations if the aren’t “native enough”. The bigger obstacle is to find an appropriate test machine with a *true* 32 Bit architecture… – Holger Aug 20 '15 at 18:42
12

When querying or manipulating a single bit, there is no significant difference. You have to calculate the word index and read that word and, in case of an update, manipulate one bit of that word and write it back. That’s all the same for int[] and long[].

One could argue that doing it using a long instead of int could raise the amount of memory that has to be transferred for a single bit operation if you have a real 32 bit memory bus, but since Java was designed in the nineties of the last century, the designers decided that this is not an issue anymore.

On the other hand, you get a big win when processing multiple bits at once. When you perform operations like and, or or xor on an entire BitSet, you can perform the operation on an entire word, read 64 bits, at once when using a long array.

Similarly, when searching for the next set bit, if the bit is not within the word of the start position, subsequent words are first tested against zero, which is an intrinsic operation, even for most 32 bit CPUs, so you can skip 64 zero bits at once while the first non-zero word will definitely contain the next set bit, so only one bit extraction operation is needed for the entire iteration.

These benefits for bulk operations will outweigh any single-bit related drawbacks, if there ever are one. As said, most today’s CPU are capable of doing all operations on 64 bit words directly.

Holger
  • 285,553
  • 42
  • 434
  • 765
  • Since BitSet is a datasture, when taling about performance, we should consider the operation it supports. You are the first man who analysis the performance in this way. Great job! – jerry_sjtu Aug 21 '15 at 01:16
6

Based on cursory reading of the source here. Seems like, the main cause is purely for performance. This is the comment retrieved from the source.

BitSets are packed into arrays of "words." Currently a word is a long, which consists of 64 bits, requiring 6 address bits. The choice of word size is determined purely by performance concerns.

kucing_terbang
  • 4,991
  • 2
  • 22
  • 28
1

Surely is an optimization issue: A single long value stores up to 64 bits, and int only 32. So, any user length under 64 needs only one entry in the array. If it was an array of int, it would have need two entries, which is slower and heavier to maintain.

Little Santi
  • 8,563
  • 2
  • 18
  • 46
1

I might be wrong but with using long[] the cardinality of bitSet is much bigger than when using the int[]. Because the max size of array is quite similar for both of them (yet limited to heap size).

Lemonov
  • 476
  • 4
  • 17
  • This superficially makes sense; the max elements of an array means an array of longs gives you more possible bits. However, I don't think that the use case of someone wanting to have that many flags is a realistic enough concern for this to be the real reason. – daiscog Aug 20 '15 at 08:19
  • The methods of `BitSet` use `int` index parameters and return values, hence, are limited to 2³¹ bits due to the API. So it doesn’t really matter whether the theoretical limit imposed by its internal array is 32 times higher or 64 times higher. Even a `byte` array could store more bits than the API supports. – Holger Aug 20 '15 at 09:43
  • Yes the bitSet internal array is indexed with integer but the values are long so we get more possible bits held than when using int. – Lemonov Aug 20 '15 at 09:49
  • I’m not talking about the array index but about the *API* of `BitSet`. Try to [set](http://docs.oracle.com/javase/8/docs/api/java/util/BitSet.html#set-int-) a bit with a number higher than 2³¹—it’s impossible as there is no method offering that operation. Similarly, [`size()`](http://docs.oracle.com/javase/8/docs/api/java/util/BitSet.html#size--) returns an `int`. So you don’t get more bits by using `long[]` instead of `int[]` internally. It’s still at most 2³¹ with the current API. – Holger Aug 20 '15 at 18:31