7

I've written the following test code:

@Test
public void testLeakWithGrizzly() throws Throwable {
    ExecutorService executor = Executors.newFixedThreadPool(N_THREADS);
    Set<Future<Void>> futures = new HashSet<>();
    InetSocketAddress inetSocketAddress = new InetSocketAddress(localhostAddress, 111);
    for (int i = 0; i < N_THREADS; i++) {
        Future<Void> future = executor.submit(new GrizzlyConnectTask(inetSocketAddress, requests, bindFailures, successfulOpens, failedOpens, successfulCloses, failedCloses));
        futures.add(future);
    }
    for (Future<Void> future : futures) {
        future.get(); //block
    }
    Thread.sleep(1000); //let everything calm down
    reporter.report();
    throw causeOfDeath;
}    
private static class GrizzlyConnectTask implements Callable<Void> {
    private final InetSocketAddress address;
    private final Meter requests;
    private final Meter bindFailures;
    private final Counter successfulOpens;
    private final Counter failedOpens;
    private final Counter successfulCloses;
    private final Counter failedCloses;

    public GrizzlyConnectTask(InetSocketAddress address, Meter requests, Meter bindFailures, Counter successfulOpens, Counter failedOpens, Counter successfulCloses, Counter failedCloses) {
        this.address = address;
        this.requests = requests;
        this.bindFailures = bindFailures;
        this.successfulOpens = successfulOpens;
        this.failedOpens = failedOpens;
        this.successfulCloses = successfulCloses;
        this.failedCloses = failedCloses;
    }

    @Override
    public Void call() throws Exception {
        while (!die) {
            TCPNIOTransport transport = null;
            boolean opened = false;
            try {
                transport = TCPNIOTransportBuilder.newInstance().build();
                transport.start();
                transport.connect(address).get(); //block
                opened = true;
                successfulOpens.inc(); //successful open
                requests.mark();
            } catch (Throwable t) {
                //noinspection ThrowableResultOfMethodCallIgnored
                Throwable root = getRootCause(t);
                if (root instanceof BindException) {
                    bindFailures.mark(); //ephemeral port exhaustion.
                    continue;
                }
                causeOfDeath = t;
                die = true;
            } finally {
                if (!opened) {
                    failedOpens.inc();
                }
                if (transport != null) {
                    try {
                        transport.shutdown().get(); //block
                        successfulCloses.inc(); //successful close
                    } catch (Throwable t) {
                        failedCloses.inc();
                        System.err.println("while trying to close transport");
                        t.printStackTrace();
                    }
                } else {
                    //no transport == successful close
                    successfulCloses.inc();
                }
            }
        }
        return null;
    }
}

on my linux laptop, this crashes in ~5 minutes with the following exception:

java.io.IOException: Too many open files
    at sun.nio.ch.EPollArrayWrapper.epollCreate(Native Method)
    at sun.nio.ch.EPollArrayWrapper.<init>(EPollArrayWrapper.java:130)
    at sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:68)
    at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:36)
    at org.glassfish.grizzly.nio.Selectors.newSelector(Selectors.java:62)
    at org.glassfish.grizzly.nio.SelectorRunner.create(SelectorRunner.java:109)
    at org.glassfish.grizzly.nio.NIOTransport.startSelectorRunners(NIOTransport.java:256)
    at org.glassfish.grizzly.nio.NIOTransport.start(NIOTransport.java:475)
    at net.radai.LeakTest$GrizzlyConnectTask.call(LeakTest.java:137)
    at net.radai.LeakTest$GrizzlyConnectTask.call(LeakTest.java:111)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

the success/fail counters look like this:

-- Counters --------------------------------------------------------------------
failedCloses
             count = 0
failedOpens
             count = 40999
successfulCloses
             count = 177177
successfulOpens
             count = 136178

-- Meters ----------------------------------------------------------------------
bindFailures
             count = 40998
         mean rate = 153.10 events/second
     1-minute rate = 144.61 events/second
     5-minute rate = 91.12 events/second
    15-minute rate = 39.56 events/second
requests
             count = 136178
         mean rate = 508.54 events/second
     1-minute rate = 547.38 events/second
     5-minute rate = 442.76 events/second
    15-minute rate = 391.53 events/second

which tells me that:

  • there were no close failures
  • all connections either failed to create or were successfully closed (136178 + 40999 = 177177)
  • all failures to open were ephemeral port exhaustion except for the last one (40999 = 40998 + 1)

the complete code is up on github here - https://github.com/radai-rosenblatt/oncrpc4j-playground/blob/master/src/test/java/net/radai/LeakTest.java

so, am i somehow misusing the grizzly API, or is this a real leak? (note - im using grizzly 2.3.12, which i know is not the latest. upgrading would require convincing people, which is why i want to be positive this is not user error on my end)

EDIT - this thing leaks even when nothing is thrown. cutting back to a single thread and putting a 2ms sleep in there still leaks 800 pipes over 50 minutes.

radai
  • 23,949
  • 10
  • 71
  • 115
  • 2
    You probably forgot to use pipe thread compound. – Hot Licks Aug 15 '15 at 16:04
  • Reuse the same transport instance, it's not supposed to be per connection. – alexey Aug 20 '15 at 06:31
  • @alexey - this is a simplified contrived example. the real code opens very short lived connections to various computers where connection (transport) pooling makes little sense (i think?) – radai Aug 20 '15 at 07:29

2 Answers2

4

I find problem deeply inside grizzly. It is internal multi threading issue (race condition). File descriptors are leaking with sun.nio.ch.EPollSelectorImpl class. Each instance contains 3 file descriptors (2 per pipe and 1 for epoll_create syscall). Grizzly sends close/shutdown in class SelectorRunner:

    public synchronized void stop() {
        stateHolder.set(State.STOPPING);
        wakeupSelector();

        // we prefer Selector thread shutdown selector
        // but if it's not running - do that ourselves.
        if (runnerThreadActivityCounter.compareAndSet(0, -1)) {
             // The thread is not running
            shutdownSelector();
        }
    }

Usually everything is fine, but sometimes selector never wakes up. Wakeup method sends interrupt by native method sun.nio.ch.EPollArrayWrapper#interrupt(int). It has simple implementation:

JNIEXPORT void JNICALL
Java_sun_nio_ch_EPollArrayWrapper_interrupt(JNIEnv *env, jobject this, int fd)
{
    int fakebuf[1];
    fakebuf[0] = 1;
    if (write(fd, fakebuf, 1) < 0) {
        JNU_ThrowIOExceptionWithLastError(env,"write to interrupt fd failed");
    }
}

So it just sends one byte to wake up waiting selector. But you close transport immediately after creation. It is rarely situation in real life, but it happens regular in your test case. Sometimes grizzly calls NIOConnection.enableIOEvent after closing and wakeup/interrupt. I think in this case selectors never wake up and never free file descriptors.

Currently I could suggest only hotfix for this situation: use timer task for direct call selector.close after some timeout:

//hotfix code bellow
private static final Timer timer = new Timer();
//hotfix code above
protected synchronized void stopSelectorRunners() {
    if (selectorRunners == null) {
        return;
    }

    for (int i = 0; i < selectorRunners.length; i++) {
        SelectorRunner runner = selectorRunners[i];
        if (runner != null) {
            runner.stop();
            //hotfix code below
            final Selector selector = runner.getSelector();
            if(selector !=null) {
                timer.schedule(new TimerTask() {
                    @Override
                    public void run() {
                        try {
                            selector.close();
                        } catch (IOException e) {
                        }
                    }
                }, 100);
            }
            //hotfix code above
            selectorRunners[i] = null;
        }
    }

    selectorRunners = null;
}

I can stop leaks after adding this to org.glassfish.grizzly.nio.NIOTransport#stopSelectorRunners

sibnick
  • 3,995
  • 20
  • 20
  • Thank you very much for your work in tracking this down. I'll post you workaround to the grizzly bug i've opened in the meantime (https://java.net/jira/browse/GRIZZLY-1797). best SO rep. points I've ever spent :-D – radai Aug 22 '15 at 08:47
  • "It is rarely situation in real life" - sadly not true. i wrote this code after tracking down a real leak in a real product the company i work for is developing. granted, the real leak was much much slower, but very real. – radai Aug 22 '15 at 08:52
  • You are welcome. It was interesting question. FYI I also found minor bug in sun/nio/ch/EPollSelectorImpl. It allocates three file descriptors in constructor, and does not close two of them if third descriptor can't be opened. I posted report to Oracle, but currently I have only Review ID. – sibnick Aug 26 '15 at 06:51
  • Not sure I understand what exactly doesn't work as expected. First you say "sometimes Selector never wakes up", does it mean there should be lots of threads blocked at Selector.select()? Could you pls. check if it's the case? Then you mentioned that NIOConnection.enableIOEvent could cause the problem, if we call it after closing/wakeup... how this is related to the first statement? Thanks! – alexey Sep 01 '15 at 00:22
3

We found the actual underlying problem within Grizzly and have fixed it.

The root of the issue is, based on the test case, Transport.stop() was being called at a point early enough in the execution of SelectorRunner.run() that would cause the run method to terminate early (due to StateHolder being in a stopped state at this point).

Additionally, because SelectorRunner.run() CASs a selector activity state change at the beginning of the run() method, the thread calling Transport.stop() sees the selector as active. Because of these two conditions, SelectorRunner.shutdownSelector() is never called and therefore we'd leak selectors.

The fix will be available in tonight's nightly build.

rlubke
  • 965
  • 5
  • 14