We have two TypeScript apps, both created through CRA, and a CI pipeline which runs a series of npm
commands to run tests/lint and build the apps for later stages:
time npm install --no-optional --unsafe-perm
npm test -- --coverage
npm run tsc
npm run lint
export REACT_APP_VERSION=$VERSION
export REACT_APP_COMMIT=$GIT_COMMIT
npm run build
npm run build-storybook
Our CI pipeline runs in Jenkins, and we're using the kubernetes plugin in order to get executors on-demand.
The script is run in parallel for app1
and app2
via the following logic in our Jenkinsfile
:
stage('Frontend - App1') {
agent {
kubernetes {
label 'node'
defaultContainer 'jnlp'
yamlFile 'infrastructure/scripts/ci/pod-templates/node.yaml'
idleMinutes 30
}
}
environment {
CI = 'true'
NPMRC_SECRET_FILE_PATH = credentials('verdaccio-npmrc')
}
steps {
dir('frontend/app1') {
container('node') {
sh 'cp $NPMRC_SECRET_FILE_PATH ~/.npmrc'
sh 'chmod u+rw ~/.npmrc'
sh '../../infrastructure/scripts/ci/build-frontend.sh'
}
publishHTML(target: [
allowMissing : false,
alwaysLinkToLastBuild: false,
keepAll : true,
reportDir : 'coverage',
reportFiles : 'index.html',
reportName : "Coverage Report (app1)"
])
junit 'testing/junit.xml'
stash includes: 'build/**/*', name: 'app1-build'
stash includes: 'storybook-static/**/*', name: 'app1-storybook-build'
}
}
}
So, onto what we're seeing. Repeatedly yesterday we saw the same symptoms: the frontend stage for app1
would complete (the smaller of the two), whilst app2
would mysteriously stop in the middle of running tests (the last line of logging in Jenkins was always PASS src/x/y/file.test.ts
, but not always the same test). It would remain in this state for a full hour before getting killed by our pipeline timeout (or a bored developer).
We ran kubectl exec -it node-blah sh
to get onto the pod that was running the stuck stage and get some diagnostics. Running ps aux | cat
gives us this:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
node 1 0.0 0.0 4396 720 ? Ss+ 08:51 0:00 cat
node 17 0.0 0.0 0 0 ? Z 08:51 0:00 [sh] <defunct>
node 32 0.0 0.0 0 0 ? Z 08:51 0:00 [sh] <defunct>
node 47 0.0 0.0 0 0 ? Z 08:51 0:00 [sh] <defunct>
node 664 0.0 0.0 0 0 ? Z 09:04 0:00 [sh] <defunct>
.
.
.
node 6760 0.0 0.0 4340 108 ? S 10:36 0:00 sh -c (pid=$$; { while [ \( -d /proc/$pid -o \! -d /proc/$$ \) -a -d '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8' -a \! -f '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt' ]; do touch '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-log.txt'; sleep 3; done } & jsc=durable-508a7912908a6919b577783c49df638d; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/script.sh' > '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt.tmp'; mv '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt.tmp' '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt'; wait) >&- 2>&- &
node 6761 0.0 0.0 4340 1060 ? S 10:36 0:00 sh -c (pid=$$; { while [ \( -d /proc/$pid -o \! -d /proc/$$ \) -a -d '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8' -a \! -f '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt' ]; do touch '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-log.txt'; sleep 3; done } & jsc=durable-508a7912908a6919b577783c49df638d; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/script.sh' > '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt.tmp'; mv '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt.tmp' '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt'; wait) >&- 2>&- &
node 6762 0.0 0.0 4340 812 ? S 10:36 0:00 sh -xe /home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/script.sh
node 6764 0.0 0.0 20096 2900 ? S 10:36 0:00 /bin/bash ../../infrastructure/scripts/ci/build-frontend.sh
node 6804 0.0 0.5 984620 38552 ? Sl 10:37 0:00 npm
node 6816 0.0 0.0 4356 836 ? S 10:37 0:00 sh -c react-app-rewired test --reporters default --reporters jest-junit "--coverage"
node 6817 0.0 0.4 877704 30220 ? Sl 10:37 0:00 node /home/jenkins/workspace/app_master/frontend/app2/node_modules/.bin/react-app-rewired test --reporters default --reporters jest-junit --coverage
node 6823 0.4 1.3 1006148 97108 ? Sl 10:37 0:06 node /home/jenkins/workspace/app_master/frontend/app2/node_modules/react-app-rewired/scripts/test.js --reporters default --reporters jest-junit --coverage
node 6881 2.8 2.6 1065992 194076 ? Sl 10:37 0:41 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6886 2.8 2.6 1067004 195748 ? Sl 10:37 0:40 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6898 2.9 2.5 1058872 187360 ? Sl 10:37 0:43 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6905 2.8 2.4 1054256 183492 ? Sl 10:37 0:42 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6910 2.8 2.6 1067812 196344 ? Sl 10:37 0:41 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6911 2.7 2.6 1063680 191088 ? Sl 10:37 0:40 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6950 0.8 1.9 1018536 145396 ? Sl 10:38 0:11 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 7833 0.0 0.0 4340 804 ? Ss 10:59 0:00 sh
node 7918 0.0 0.0 4240 652 ? S 11:01 0:00 sleep 3
node 7919 0.0 0.0 17508 2048 ? R+ 11:01 0:00 ps aux
node 7920 0.0 0.0 4396 716 ? S+ 11:01 0:00 cat
From the manual on ps:
S interruptible sleep (waiting for an event to complete)
l is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)
So I think what this shows is that the tests have started running fine, spawned child processes to run them in parallel, and then for whatever reason after 40 seconds or so those processes have all gone to sleep and are no longer doing anything.
We're pretty stumped with how to investigate this further, particularly as we have the awkwardness of not easily being able to install whatever we like into the pod for further investigation (no easy root access)... but any suggested theories / next steps would be welcomed!
** EDIT **
It seems idleMinutes
wasn't the culprit, as several times today we've seen the issue happen again since reverting it. I've been able to verify that the script was running in a brand new node in kubernetes which hadn't been used by any other builds previously. So now I have no idea what's even changed recently to make this start happening :(