Why is java.io.Reader#skip implemented the way that it is?

Question

I'm still learning object-oriented programming in Java. I was looking at the Java implementation of java.io.Reader.skip and I'm wondering why exactly it's implemented the way that it is. In particular I have questions about these things that I have noticed:

The buffer used for the skip(long) is a field of the Reader object, rather than a normal variable in the method.
The maximum buffer length is much less than Integer.MAX_VALUE 2147,483,647. In particular, Java's implementation uses 8192.
java.io.InputStream implements skip the same exact way.

Now, the reasons why I personally think that the buffer is a field, is so that the buffer won't have to be garbage collected repetitively due to being reinitialised repetitively. This might make skipping faster.

The buffer length being smaller I think has to do with making it so that the Reader blocks for shorter periods, but since the Reader is synchronized, would that really make a difference?

Byte streams implementing it the same way, might be for consistency. Are my assumptions correct on these three things?

To summarise, my questions are: About how much of a difference in speed on average does it make to use a field rather than a variable for character arrays? Wouldn't it be just the same to use Integer.MAX_VALUE as the maximum buffer length? And isn't it better and easier to use the no-parameter read method in a for-loop for byte streams since the other read methods just call the no-parameter read?

Sorry if my question's a strange question, but I think that I can learn a lot about object-oriented programming through this question.

A reason for a smaller buffer is to reduce the amount of memory consumed. If the buffer was 2 GB, it would read 2 GB into memory then flush it to wherever the stream was writing to, versus something smaller like 8K. — vcsjones, Jun 02 '11 at 20:07
@vcsjones I see =) That does make sense. Thank you very much. That explains the buffer size being exactly 8k. I have to keep in mind that a character costs a byte in memory, but I guess I am still learning ^_^; — user766413, Jun 03 '11 at 18:44

score 2 · Answer 1 · answered Jun 02 '11 at 20:12

Reading one char at a time would be much less efficient - you'd have one method call per byte skipped, which is usually bad for large skips (a lot of overhead).

The scratch buffer size is simple to answer: would you really want to allocate an Integer.MAX_VALUE chunk of RAM if you're going to skip 2G from a file?

As for the exact size, and whether or not to use an instance varialbe, that's an implementation-dependent compromise. You're reading an implementation that chose 8192 member. Some implementations have smaller, local ones (512 can be seen here).

Nothing in the standard requires any of these implementation details, so don't rely on them at all.

If you're planning on doing something similar, benchmark the different approaches and pick the best compromise in your specific circumstances.

I understand, and you're right that the standard doesn't specify any of these details, so I suppose different implementations may use different buffer sizes, which probably explains the reason why the maximum buffer length field is a field instead of a variable as well. (In order to allow hiding of the field, but in my opinion, a protected method would do better, because I have a thing against field hiding.) Thank you on the benchmarking suggestion, I will attempt to look up how to benchmark my programs =) — user766413, Jun 03 '11 at 19:06

score 2 · Answer 2 · edited Apr 30 '18 at 07:42

2

About how much of a difference in speed on average does it make to use a field rather than a variable for character arrays?

This would definitely vary from JVM to JVM, but repeatedly allocating a 8K array is probably not as cheap as keeping one around. Of course, the hidden lesson here is that one should not hold onto readers, even closed ones, because they carry an 8K penalty.

Wouldn't it be just the same to use Integer.MAX_VALUE as the maximum buffer length?

The buffer has to get pre-allocated, and allocating a 2Gb array seems like an overkill. Remember, the reason for paging is to amortize the cost of the read call -- which sometimes turns into native operations.

Isn't it better and easier to use the no-parameter read method in a for-loop for byte streams since the other read methods just call the no-parameter read?

It is not guaranteed that the underlying stream is buffered, so this may incur heavy per-call overhead.

Finally, keep in mind that the java.io classes have many, many deficiencies, so don't assume that everything there is there with good reasons.

edited Apr 30 '18 at 07:42

user207421

305,947
44
307
483

answered Jun 02 '11 at 20:13

Dilum Ranatunga

13,254
3
41
52

curious what the "many, many deficiencies" are in the java.io classes? – jtahlborn Jun 02 '11 at 22:53
Start with non-final protected members, whose access model wrt to threading is completely undefined. There's the use of the magic -1 instead of a public constant. By the time Java 1.1 came around, it was too late; the public API was hosed. – Dilum Ranatunga Jun 03 '11 at 01:27
i assume you mean "-1" for the read() method? okay, nothing stopping them from making that a constant at any point in time, but i wouldn't consider that a "deficiency". I do agree that the protected members are kind of ugly and not well thought out, but i _rarely_ need to muck w/ the io classes in such a way that that is a major issue. for the "average" developer, i would say that the io package is actually in pretty good shape. certainly not in the category of "many, many deficiencies". (i certainly accept that java has its warts, this just seems fairly minor to me). – jtahlborn Jun 03 '11 at 16:44
(especially considering that the `skip()` implementation in question seems fairly reasonable). – jtahlborn Jun 03 '11 at 16:48
@jtahborn -- Java is one of my favorite languages, but I think we'll have to agree to disagree on "for the average developer, i would say that the io package is actually in pretty good shape". My case and point -- what is the correct idiom for reading the contents of a file and closing the stream? – Dilum Ranatunga Jun 03 '11 at 18:26
@Dilum Ranatunga: I see what you're saying =) The one thing about my third question though, is that in the implementation for byte streams, `read(byte[], int, int)` already does multiple calls to `read()` in one for-loop. But I understand how my idea may be a bad implementation for character streams or even specialised byte streams. I suppose that I have to keep in mind how `InputStream` is an abstract byte stream so it has to use the best possible but most abstract implementations. – user766413 Jun 03 '11 at 19:02
@user766413, `read(byte[], int, int)` is override by any serious reader implementation. Concrete implementations will figure out how much to read from whatever underlying source. For example, a reader on top of a UCS2 encoded codepoint stream would need to read one or two bytes per character. So the reader would read `length` bytes, knowing that it may only fill half of allotted character buffer (array). – Dilum Ranatunga Jun 03 '11 at 19:21
@Dilum Ranatunga Yes, I keep forgetting that, but you're right, and the same thing applies to `InputStream.read(byte[], int, int)` so that means that if that's overridden, then the more efficient implementation is sort of passed onto `skip` as well. Thanks for answering that =) – user766413 Jun 04 '11 at 16:35

score 2 · Answer 3 · edited Oct 10 '19 at 20:11

For InputStream, you often have subclasses which allow much more efficient skipping, and these override the skip method appropriately. But for those subclasses which do not have an efficient way of skipping (like a compressing or decompressing input stream), the skip method is implemented based on reading, so not every subclass has to do the same.

There are several strategies on how to implement this in the java.io package:

Skipping the Base Stream:

FilterInputStream.skip() simply delegates to the source stream. I'm not so sure how useful this is, though.
DataInputStream does not override skip(), but has another method named skipBytes() which does the same thing (only for int arguments, though). It delegates to the underlying source stream.
BufferedInputStream.skip() overrides this, skipping first the existing contents in its own buffer, then calling skip() on the base stream (if there is no mark() set - if there is a mark, it has to read everything into the buffer to support reset()).
PushbackInputStream.skip() skips first over its pushback buffer, and then calls super.skip() (which is FilterInputStream.skip(), see above).

Resetting an Index:

ByteArrayInputStream can trivially support skipping, simply by setting the position where to read next.
StringBufferInputStream (which is a deprecated version of StringReader) supports skipping simply by resetting the index.

Native Magic:

FileInputStream has skip() as a native method. I think this would be the canonical example where it is most useful.

Read Everything and Throw it Away:

LineNumberInputStream.skip() has to read everything to count the lines. (I did not know that this class existed. Use LineNumberReader instead.)
ObjectInputStream does not override skip(), but has another method named skipBytes() which does the same thing (only for int arguments, though). It delegates to an inner class (BlockDataInputStream.skip()), which in turn reads from the underlying stream, respecting the Object stream protocol for block data.

Default implementation in `InputStream`:

This is also used by SequenceInputStream and PipedInputStream.

Let's have a look at the Reader classes. In principle, the same strategies apply:

Skip the Base Reader/Stream:

FilterReader.skip() does this.
PushBackReader first skips its own pushback buffer, then the base reader.

Reset Some Index:

StringReader (this one actually supports backwards skipping)
CharArrayReader

Read Everything and Throw it Away:

The default Reader.skip(), which is also used by PipedReader.
For InputStreamReader the "simply skip the base stream" approach only would work for fixed-byte-count charsets (i.e. the ISO-8859 series, UTF-16 and some similar ones), not for UTF-8, UTF-32 or other charsets with a variable number of bytes per character, since we would have to read all bytes to know how many characters they are representing, in fact. This also applies to its subclass FileReader.
BufferedReader (it does not call its own read(), but fills its internal buffer, which reads from the base stream).
LineNumberReader (it has to do this to keep track of the line numbers)

I see =) I suppose I have to keep in mind that above all, `java.io.InputStream` and `java.io.Reader` are just abstract classes and that more specialised classes will override the `skip` method to provide more efficient or appropriate behaviours. — user766413, Jun 03 '11 at 19:13

score 1 · Answer 4 · answered Jun 02 '11 at 20:14

1

you are forgetting that a buffer of 2^31 - 1 is 2 GB of memory that has to be allocated that then cannot be used for anything else

allocating a large contiguous byte block of 2 gigabytes is overkill for reading in bytes and it could cause out of memory situations

a maximum memory buffer of 8 kB is much better alternative and a better trade-off as it will only be allocated once (and it will be reused on each skip operation)

btw in java.io.InputStream the skipbuff is static and only ever allocated once but as there are no reads from it (it's just used as a write-only memory) there is no need to worry about races

answered Jun 02 '11 at 20:14

ratchet freak

47,288
5
68
106

Yeah, you're right. I had forgot that each character will be at least one byte. So an 8 kB (or similar) buffer is a lot better. I'm not sure what you meant by races by the way, but I think you may have made a typo when you said that the skipbuff is static. At least in the implementations that I've looked at it wasn't. I had even considered asking why it wasn't, and I reasoned that it would be because you'd have to synchronize on it if you had multiple threads using byte streams, and that would slow down skips significantly. I may be wrong about that. – user766413 Jun 03 '11 at 19:11
@user check http://www.docjar.com/html/api/java/io/InputStream.java.html skipBuf is used in skip and it's static because if you are only writing (using the buffer as scrap memory) you can share the buffer without consequence – ratchet freak Jun 03 '11 at 22:55
Well, actually I was talking about character input stream Readers, but you're right, byte input streams do use a static buffer and a local copy, which brings up the question of why Reader doesn't do that too, since like you said, it's only being used for scrap memory. But thank you for explaining why it's shareable. I was a bit confused about that since it was passed to the `read(byte[], int, int)` method, but now I'm not =) – user766413 Jun 04 '11 at 16:42

Why is java.io.Reader#skip implemented the way that it is?

4 Answers4

Skipping the Base Stream:

Resetting an Index:

Native Magic:

Read Everything and Throw it Away:

Default implementation in InputStream:

Skip the Base Reader/Stream:

Reset Some Index:

Read Everything and Throw it Away:

Default implementation in `InputStream`: