1

I have a maven build that I'm running in a docker container using the official maven image from the Docker hub. The .m2 directory is mounted to an NFS share.

This works in one environment, but in another identical environment it will always hang after writing the lock file. It never completes the download, but just hangs there forever. Since maven debug didn't give me any details after the point where it hung, I decided to watch the .m2 directory to see what happened.

ubuntu@kubernetes-dev-nfs-pv:/nfs-shares/jenkins/.m2$ inotifywait -m -r .
Setting up watches.  Beware: since -r was given, this may take a while!
Watches established.
./ CREATE,ISDIR repository
./ OPEN,ISDIR repository
./ CLOSE_NOWRITE,CLOSE,ISDIR repository
./repository/ CREATE,ISDIR org
./repository/ OPEN,ISDIR org
./repository/ CLOSE_NOWRITE,CLOSE,ISDIR org
./repository/org/ CREATE,ISDIR springframework
./repository/org/ OPEN,ISDIR springframework
./repository/org/ CLOSE_NOWRITE,CLOSE,ISDIR springframework
./repository/org/springframework/ CREATE,ISDIR boot
./repository/org/springframework/ OPEN,ISDIR boot
./repository/org/springframework/ CLOSE_NOWRITE,CLOSE,ISDIR boot
./repository/org/springframework/boot/ CREATE,ISDIR spring-boot-starter-parent
./repository/org/springframework/boot/ OPEN,ISDIR spring-boot-starter-parent
./repository/org/springframework/boot/ CLOSE_NOWRITE,CLOSE,ISDIR spring-boot-starter-parent
./repository/org/springframework/boot/spring-boot-starter-parent/ CREATE,ISDIR 1.3.7.RELEASE
./repository/org/springframework/boot/spring-boot-starter-parent/ OPEN,ISDIR 1.3.7.RELEASE
./repository/org/springframework/boot/spring-boot-starter-parent/ CLOSE_NOWRITE,CLOSE,ISDIR 1.3.7.RELEASE
./repository/org/springframework/boot/spring-boot-starter-parent/1.3.7.RELEASE/ CREATE spring-boot-starter-parent-1.3.7.RELEASE.pom.part.lock

Maven seems to be working, it creates a number of folders and even the lock file, but then it hangs. How can I get maven to complete or find some additional information to help me solve this.

By the way, if I use ephemeral storage inside the container, it downloads the package as expected.

UDPATE: one of the comments suggested a thread dump. Below you can see I attached to the running container. I confirm that the container can modify files in the .m2 directory and then I use jstack to get a thread dump of the process.

ec2-user@murano-necrhj0ld3vkx4-kube-3-gm5dmsfnftxn:~$ sudo docker ps
CONTAINER ID        IMAGE                                                     COMMAND                  CREATED             STATUS              PORTS               NAMES
c7d1f4c91559        maven:alpine                                              "cat"                    About an hour ago   Up About an hour                        agitated_cori
ec2-user@murano-necrhj0ld3vkx4-kube-3-gm5dmsfnftxn:~$ sudo docker exec -ti c7d1f4c91559 /bin/bash
bash-4.3$ ps
PID   USER     TIME   COMMAND
    1 1000       0:00 cat
    6 1000       0:00 sh -c echo $$ > '/var/jenkins_home/workspace/api-product@tmp/durable-ca9825bd/pid'; jsc=durable-04ba6b757bca34373f180bd01ef64ca1; JENKINS_SERVER_COOKIE=$jsc '/var/jenkins_home/workspace/api-product@tmp/durable-ca
   12 1000       0:00 {script.sh} /bin/sh -xe /var/jenkins_home/workspace/api-product@tmp/durable-ca9825bd/script.sh
   13 1000       0:07 /usr/lib/jvm/java-1.8-openjdk/bin/java -classpath /usr/share/maven/boot/plexus-classworlds-2.5.2.jar -Dclassworlds.conf=/usr/share/maven/bin/m2.conf -Dmaven.home=/usr/share/maven -Dmaven.multiModuleProjectDirecto
 1584 1000       0:00 /bin/bash
 1589 1000       0:00 ps
bash-4.3$ cat /var/jenkins_home/workspace/api-product@tmp/durable-ca9825bd/script.sh
#!/bin/sh -xe
mvn -Dmaven.repo.local="$PWD"/../../.m2/repository clean compile
bash-4.3$ ls -la /var/jenkins_home/.m2/
total 16
drwxr-xr-x    3 1000     1000          4096 May  2 21:14 .
drwxrwxr-x   23 1000     1000          4096 May  3 11:55 ..
-rw-r--r--    1 1000     1000             6 May  2 21:14 file.txt
drwxr-xr-x    3 1000     1000          4096 May  2 20:50 repository
bash-4.3$ cat /var/jenkins_home/.m2/file.txt
hello
bash-4.3$ vi /var/jenkins_home/.m2/file.txt
bash-4.3$ cat /var/jenkins_home/.m2/file.txt
hello
another

bash-4.3$ jstack 13
2017-05-03 13:04:37
Full thread dump OpenJDK 64-Bit Server VM (25.121-b13 mixed mode):

"Attach Listener" #11 daemon prio=9 os_prio=0 tid=0x00007fc4a4956800 nid=0x6a7 runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Service Thread" #8 daemon prio=9 os_prio=0 tid=0x00007fc4a4343000 nid=0x2c runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C1 CompilerThread2" #7 daemon prio=9 os_prio=0 tid=0x00007fc4a4311800 nid=0x2b waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" #6 daemon prio=9 os_prio=0 tid=0x00007fc4a4302000 nid=0x2a waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" #5 daemon prio=9 os_prio=0 tid=0x00007fc4a42ff000 nid=0x29 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" #4 daemon prio=9 os_prio=0 tid=0x00007fc4a42fc800 nid=0x28 runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" #3 daemon prio=8 os_prio=0 tid=0x00007fc4a42d5000 nid=0x27 in Object.wait() [0x00007fc48ba4b000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00000000dab108d8> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
    - locked <0x00000000dab108d8> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
    at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)

"Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007fc4a42ca800 nid=0x26 in Object.wait() [0x00007fc48bb4c000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00000000dab18178> (a java.lang.ref.Reference$Lock)
    at java.lang.Object.wait(Object.java:502)
    at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
    - locked <0x00000000dab18178> (a java.lang.ref.Reference$Lock)
    at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)

"main" #1 prio=5 os_prio=0 tid=0x00007fc4a4179800 nid=0x20 runnable [0x00007fc4a3426000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.FileDispatcherImpl.lock0(Native Method)
    at sun.nio.ch.FileDispatcherImpl.lock(FileDispatcherImpl.java:90)
    at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:1115)
    at org.eclipse.aether.connector.basic.PartialFile$LockFile.tryLock(PartialFile.java:135)
    at org.eclipse.aether.connector.basic.PartialFile$LockFile.lock(PartialFile.java:80)
    at org.eclipse.aether.connector.basic.PartialFile$LockFile.<init>(PartialFile.java:67)
    at org.eclipse.aether.connector.basic.PartialFile$Factory.newInstance(PartialFile.java:219)
    at org.eclipse.aether.connector.basic.BasicRepositoryConnector$GetTaskRunner.runTask(BasicRepositoryConnector.java:441)
    at org.eclipse.aether.connector.basic.BasicRepositoryConnector$TaskRunner.run(BasicRepositoryConnector.java:359)
    at org.eclipse.aether.util.concurrency.RunnableErrorForwarder$1.run(RunnableErrorForwarder.java:76)
    at org.eclipse.aether.connector.basic.BasicRepositoryConnector$DirectExecutor.execute(BasicRepositoryConnector.java:590)
    at org.eclipse.aether.connector.basic.BasicRepositoryConnector.get(BasicRepositoryConnector.java:258)
    at org.eclipse.aether.internal.impl.DefaultArtifactResolver.performDownloads(DefaultArtifactResolver.java:529)
    at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolve(DefaultArtifactResolver.java:430)
    at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolveArtifacts(DefaultArtifactResolver.java:255)
    at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolveArtifact(DefaultArtifactResolver.java:232)
    at org.eclipse.aether.internal.impl.DefaultRepositorySystem.resolveArtifact(DefaultRepositorySystem.java:303)
    at org.apache.maven.project.ProjectModelResolver.resolveModel(ProjectModelResolver.java:193)
    at org.apache.maven.project.ProjectModelResolver.resolveModel(ProjectModelResolver.java:243)
    at org.apache.maven.model.building.DefaultModelBuilder.readParentExternally(DefaultModelBuilder.java:1051)
    at org.apache.maven.model.building.DefaultModelBuilder.readParent(DefaultModelBuilder.java:829)
    at org.apache.maven.model.building.DefaultModelBuilder.build(DefaultModelBuilder.java:331)
    at org.apache.maven.project.DefaultProjectBuilder.build(DefaultProjectBuilder.java:429)
    at org.apache.maven.project.DefaultProjectBuilder.build(DefaultProjectBuilder.java:398)
    at org.apache.maven.project.DefaultProjectBuilder.build(DefaultProjectBuilder.java:361)
    at org.apache.maven.graph.DefaultGraphBuilder.collectProjects(DefaultGraphBuilder.java:400)
    at org.apache.maven.graph.DefaultGraphBuilder.getProjectsForMavenReactor(DefaultGraphBuilder.java:391)
    at org.apache.maven.graph.DefaultGraphBuilder.build(DefaultGraphBuilder.java:78)
    at org.apache.maven.DefaultMaven.buildGraph(DefaultMaven.java:511)
    at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:221)
    at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:194)
    at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:107)
    at org.apache.maven.cli.MavenCli.execute(MavenCli.java:993)
    at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:345)
    at org.apache.maven.cli.MavenCli.main(MavenCli.java:191)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
    at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
    at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)

"VM Thread" os_prio=0 tid=0x00007fc4a42c0000 nid=0x25 runnable

"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007fc4a4190800 nid=0x21 runnable

"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007fc4a4192000 nid=0x22 runnable

"GC task thread#2 (ParallelGC)" os_prio=0 tid=0x00007fc4a4194000 nid=0x23 runnable

"GC task thread#3 (ParallelGC)" os_prio=0 tid=0x00007fc4a4195800 nid=0x24 runnable

"VM Periodic Task Thread" os_prio=0 tid=0x00007fc4a4382800 nid=0x2d waiting on condition

JNI global references: 235

bash-4.3$

Inside the container I have just confirmed that I can access the pom and I show the debug output, which just hangs at downloading.

https://gist.github.com/dwatrous/34e1edc1db5e4756d4b33c83a9c2ccd0

Daniel Watrous
  • 3,467
  • 2
  • 36
  • 48
  • I just confirmed that I can write files from within the maven container to the .m2 directory, so I really doubt this has anything to do with NFS or file permissions. This appears to be primarily a maven problem. – Daniel Watrous May 02 '17 at 21:39
  • Does it work when not using NFS? As it works in another environment something has to be different. – Erik Dannenberg May 03 '17 at 01:11
  • A thread dump of the hang will be useful for diagnosing the problem. – Steve C May 03 '17 at 03:54
  • You are using the .m2/repository on a NFS share. Does that mean you are accessing this share from several instances running mvn builds ? – khmarbaise May 03 '17 at 17:33
  • Yes, there are potentially multiple maven processes running at the same time, but that isn't the case right now. – Daniel Watrous May 03 '17 at 18:04

2 Answers2

1

It is highly likely related to NFS, and file locking bugs and/or semantics.

Others have reported similar problems with FileChannel#tryLock over NFS; see e.g. JDK-8156026 and JDK-8065927.

The contract of that method says that tryLock does not block, so any blocking which occurs is due to a native system call not returning when it should. It is possible Maven could try to work around such bugs somehow, but I think any attempt to do so would be quite hacky and maybe introduce more bugs than it avoids.

You could try different versions of Java, including both Oracle and OpenJDK, at different release versions...

ctrueden
  • 6,751
  • 3
  • 37
  • 69
  • I suspect this answer is the right direction, but I'm not sure how to further troubleshoot. Between two environments, one works and the other does not. I'm interested in figuring out what is wrong with this NFS server so it will work like the other one. I didn't see a resolution cited in the two issues you linked. – Daniel Watrous May 03 '17 at 18:11
  • You say your two environments are "identical", but _something_ must be different, no? Is the hang deterministically reproducible on one system, and deterministically not present on the other? Are you running builds concurrently across multiple systems? I am certainly no expert on NFS, but maybe this article has something useful: https://docstore.mik.ua/orelly/networking_2ndEd/nfs/ch11_03.htm ? – ctrueden May 03 '17 at 20:57
  • One difference that I'm looking at is that the working system uses ephemeral disk while the system that hangs at the file lock uses a cinder volume for NFS data. – Daniel Watrous May 03 '17 at 21:02
  • 1
    I have confirmed that when I host the NFS files on ephemeral disk, I don't get this filelock problem, and maven doesn't hang. I'm still looking at why. – Daniel Watrous May 03 '17 at 21:15
0

In our case the problem was definitely related to out NetApp ONTAP NFS and was caused by not having port 4045 for nlockmgr open in the firewall. Writing files worked fine, locking files before writing, like maven does for its cache did not.

Hard to debug because nothing timed out, the process just hung until the port was opened. Really curious why it does not time out if the port is closed.

Which Network File System (NFS) TCP and NFS UDP ports are used on the storage system?