0

I have implemented my own class to read pcap files. (Binary files, i.e. tcpdump, wireshark)

public class PcapReader implements Iterator<PcapPacket> {
    private InputStream is;

    public PcapReader (File file) throws FileNotFoundException, IOException {
        is = this(new DataInputStream(
             new BufferedInputStream(
                 new FileInputStream(file))));
    }

    @Override
    public boolean hasNext () {
        try {
            return (is.available() > 0);
        } catch (IOException e) {
            return false;
        }
    }

    //pseudo code!
    @Override
    public PcapPacket next () {
        is.read(header);
        is.read(body);

        return new PcapPacket(header, body);
    }

    //more code here
}

Then I use it like this:

PcapReader reader = new PcapReader(file);
while (reader.hasNext()) {
    PcapPacket pcapPacket = reader.next();
    //process packet
}

The file under test has 190 Mb. And I also use JVisualVM to profile.

  • hasNext() is called 1.7 million times and time is 7.7 seconds

  • next() is called same number of times and time is 3.6 seconds

My main question is why hasNext() is so time consuming in absolute value and also twice greater than next?

Nikolay Kuznetsov
  • 9,467
  • 12
  • 55
  • 101
  • 1
    I would avoid available() like a pest - its broken by design (returns an int!) for starters, and you have to deal with IOException while actually reading the data anyway. Its very rare one actually *needs* to use available(). – Durandal Mar 06 '13 at 15:36

3 Answers3

2

When you call is.available(), in your hasNext() method, it goes down to FileInputStream.available() implementation. This is a native method, as one may see from FileInputStream source code.

In the end, this is indeed a time-consumming operation, as the Operating System implementation of the file operations will have to check ahead if more data is available to be read. So, it will actually do a read operation without updating the file pointer (or updating it back to the original position), just to check if there is a "next" byte.

Filipe Fedalto
  • 2,540
  • 1
  • 18
  • 21
  • Indeed, you just passed ahead me with this explanation for 20 seconds! +1 – Andremoniy Mar 06 '13 at 14:24
  • +1, but why it goes down to `FileInputStream.available()` if there is a buffered wrapper? How can I make `hasNext()` effictive? – Nikolay Kuznetsov Mar 06 '13 at 14:32
  • 1
    @NikolayKuznetsov You can implement this more effective by blindly attempting to read in the *hasNext* method. It either fails (EOF / IOException), then hasNext = false. If it succeeds, *remember* the packet read and return it on the *next()* call. No need for available() anywhere. – Durandal Mar 06 '13 at 18:13
  • @Durandal, well, in that case invoking `hasNext()` several times would skip some packets. I think that it is better to drop `hasNext()` and `Iterator` at all. – Nikolay Kuznetsov Mar 07 '13 at 03:07
  • @NikolayKuznetsov That was a rough outline. I expect the reader to fill in the missing state transitions. Obviously hasNext() would need to check if its memory slot is already occupied. Same with next(), obviously it needs to call hasNext() internally and throw NoSuchElementException if the memory slot is empty... doh! Drop the iterator if you want, but don't blame it on not being possible to implement properly. – Durandal Mar 07 '13 at 14:41
1

I'm sure, that internal (native) implementation of available() method is not something like just returning some return availableSize;, but more complicated. Stream counts available data using OS API; especially, for example, for log files, which are written due Stream reads them.

Andremoniy
  • 34,031
  • 20
  • 135
  • 241
  • +1, but why it goes down to `FileInputStream.available()` if there is a buffered wrapper? How can I make `hasNext()` effictive? – Nikolay Kuznetsov Mar 06 '13 at 14:33
  • 1
    Because BufferedInputStream.available() implementation actually calls the underlying InputStream one: `return getInIfOpen().available() + (count - pos);`. Please note that the `available()` method does not just check for availability, but actually returns the number of available bytes. So it can't count only on what's left in the buffer. – Filipe Fedalto Mar 06 '13 at 14:45
1

I have implemented my own class to read pcap files.

Because you're not using jNetPcap, or because you are using jNetPcap but need something that can read from a File?

If the latter, you probably want to use a pattern other than one that has a "more data is available" method and a separate "so read that data" method; something that reads the data and either returns a "packet available"/"end of file"/"error" indication or throws an exception for one or both of the latter conditions (DataInputStream appears to throw exceptions for both I/O errors and EOF, so it might make sense to do the same for your class).

Yeah, that means it can't be an Iterator, but maybe Iterators weren't originally intended to represent records in a sequential file (besides, if you really want it to be an Iterator, what are you going to do about the remove method?).

And if you can avoid needing to read from a File, you could then use jNetPcap's own routines for reading capture files, which, in libpcap 1.1.0 and later, can also read some pcap-ng files.