0

my problem is to get huge Text Files (UTF-8 -1byte (ANSI)) containing unsigned Integers without duplicates in Ascending Order into an Array. FAST! So I was going for something like:

while(scan.hasNextInt()) x.add(scan.nextInt());

But whether i go with an ArrayList, Vectors or a plain Array with Files containing millions of Integers it would be wise to determine the maximum Capacity needed to avoid increasing the array size later.

With File.length() i will get the amount of digits + Line Feeds in the File.

In the worst Case it would start at 0 and in each line only increment by 1.
I think somehow the max. capacity is calculable using combinatorics, but I am at a dead end. The fact that smaller Numbers don't get filled with Zeros (002) somehow throws me off.

Taking the size of the first Int into consideration i think one might also be able to approximate a little further to the real amount.

So my most important question is to calculate an approximated [in O(1)]maximum Capacity needed.

In addition I am asking my self if scan.hasNextInt() and scan.nextInt() are the fastest considering this rather unique problem and if parallelization via Threads could speed up the process even more (considering the features of reading from a Hard Drive probably not).

regards Halo

  • 1
    Honestly, I wouldn't worry about it. You're likely to be limited by I/O bandwidth first. – Oliver Charlesworth Jan 13 '13 at 20:25
  • Try the easy way first. Don't optimize unless you find a problem. – Bohemian Jan 13 '13 at 20:27
  • If you store `Integer`s instead of `int`s what will be the case if you use `ArrayList` or `Vector` you already waste so much memory in contrast to storing them in an `int[]` that computing an optimal initial capacity for this collections is a wast of time. Find an upperbou8nd for the numver of values (does not need to be very sharp) and than use an `int[]`. – MrSmith42 Jan 13 '13 at 20:34
  • that's not an option for me. – Halo Camper Jan 13 '13 at 20:34
  • Have measured that it will make a measurable difference? What makes a difference in theory and what really matters is often very different. – Peter Lawrey Jan 13 '13 at 20:35
  • Using `Integer` instead of `int` makes so much difference that tinkering around the edges is unlikely to make much difference. – Peter Lawrey Jan 13 '13 at 20:36
  • 1
    @MrSmith42 int[] would also be my preferred choice – Halo Camper Jan 13 '13 at 20:37
  • still I don't think i should make the uperbound as big as the RAM... – Halo Camper Jan 13 '13 at 20:40

1 Answers1

1

Assuming there is only one byte used to separate two numbers (eg. a '\n') we have

  • 10 numbers with 1 digit -> 20 bytes
  • 90 numbers with 2 digits -> 270 bytes
  • 900 numbers with 3 digits -> 3600 bytes
  • ... you get the pattern

If your file size is now 1000 bytes, the max you can have is the 10 1 digits, the 90 two digits, with 710 bytes left for 3 digit numbers. 710/4 = 177.5, which makes at most 10+90+177 = 277 numbers.

Henry
  • 42,982
  • 7
  • 68
  • 84
  • This looks logarithmic. Is there a faster approach? I love O(1) ^^ – Halo Camper Jan 13 '13 at 20:48
  • 2
    @HaloCamper: You could achieve O(1) with a lookup table, for example. Also, consider that you need to do this **once** per file; who cares what its complexity is? – Oliver Charlesworth Jan 13 '13 at 20:52
  • 1
    Frankly, I don't think this matters. The O(log n) effort for this calculation is way less than the O(n) needed to read the numbers. – Henry Jan 13 '13 at 20:52