3

We have two TypeScript apps, both created through CRA, and a CI pipeline which runs a series of npm commands to run tests/lint and build the apps for later stages:

time npm install --no-optional --unsafe-perm
npm test -- --coverage

npm run tsc
npm run lint

export REACT_APP_VERSION=$VERSION
export REACT_APP_COMMIT=$GIT_COMMIT

npm run build
npm run build-storybook

Our CI pipeline runs in Jenkins, and we're using the kubernetes plugin in order to get executors on-demand. The script is run in parallel for app1 and app2 via the following logic in our Jenkinsfile:

stage('Frontend - App1') {
    agent {
        kubernetes {
            label 'node'
            defaultContainer 'jnlp'
            yamlFile 'infrastructure/scripts/ci/pod-templates/node.yaml'
            idleMinutes 30
        }
    }
    environment {
        CI = 'true'
        NPMRC_SECRET_FILE_PATH = credentials('verdaccio-npmrc')
    }
    steps {
        dir('frontend/app1') {
            container('node') {
                sh 'cp $NPMRC_SECRET_FILE_PATH ~/.npmrc'
                sh 'chmod u+rw ~/.npmrc'
                sh '../../infrastructure/scripts/ci/build-frontend.sh'
            }
            publishHTML(target: [
                    allowMissing         : false,
                    alwaysLinkToLastBuild: false,
                    keepAll              : true,
                    reportDir            : 'coverage',
                    reportFiles          : 'index.html',
                    reportName           : "Coverage Report (app1)"
            ])
            junit 'testing/junit.xml'
            stash includes: 'build/**/*', name: 'app1-build'
            stash includes: 'storybook-static/**/*', name: 'app1-storybook-build'
        }
    }
}

So, onto what we're seeing. Repeatedly yesterday we saw the same symptoms: the frontend stage for app1 would complete (the smaller of the two), whilst app2 would mysteriously stop in the middle of running tests (the last line of logging in Jenkins was always PASS src/x/y/file.test.ts, but not always the same test). It would remain in this state for a full hour before getting killed by our pipeline timeout (or a bored developer).

We ran kubectl exec -it node-blah sh to get onto the pod that was running the stuck stage and get some diagnostics. Running ps aux | cat gives us this:

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
node           1  0.0  0.0   4396   720 ?        Ss+  08:51   0:00 cat
node          17  0.0  0.0      0     0 ?        Z    08:51   0:00 [sh] <defunct>
node          32  0.0  0.0      0     0 ?        Z    08:51   0:00 [sh] <defunct>
node          47  0.0  0.0      0     0 ?        Z    08:51   0:00 [sh] <defunct>
node         664  0.0  0.0      0     0 ?        Z    09:04   0:00 [sh] <defunct>
.
.
.
node        6760  0.0  0.0   4340   108 ?        S    10:36   0:00 sh -c (pid=$$; { while [ \( -d /proc/$pid -o \! -d /proc/$$ \) -a -d '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8' -a \! -f '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt' ]; do touch '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-log.txt'; sleep 3; done } & jsc=durable-508a7912908a6919b577783c49df638d; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe  '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/script.sh' > '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt.tmp'; mv '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt.tmp' '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt'; wait) >&- 2>&- &
node        6761  0.0  0.0   4340  1060 ?        S    10:36   0:00 sh -c (pid=$$; { while [ \( -d /proc/$pid -o \! -d /proc/$$ \) -a -d '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8' -a \! -f '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt' ]; do touch '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-log.txt'; sleep 3; done } & jsc=durable-508a7912908a6919b577783c49df638d; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe  '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/script.sh' > '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt.tmp'; mv '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt.tmp' '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt'; wait) >&- 2>&- &
node        6762  0.0  0.0   4340   812 ?        S    10:36   0:00 sh -xe /home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/script.sh
node        6764  0.0  0.0  20096  2900 ?        S    10:36   0:00 /bin/bash ../../infrastructure/scripts/ci/build-frontend.sh
node        6804  0.0  0.5 984620 38552 ?        Sl   10:37   0:00 npm                                       
node        6816  0.0  0.0   4356   836 ?        S    10:37   0:00 sh -c react-app-rewired test --reporters default --reporters jest-junit "--coverage"
node        6817  0.0  0.4 877704 30220 ?        Sl   10:37   0:00 node /home/jenkins/workspace/app_master/frontend/app2/node_modules/.bin/react-app-rewired test --reporters default --reporters jest-junit --coverage
node        6823  0.4  1.3 1006148 97108 ?       Sl   10:37   0:06 node /home/jenkins/workspace/app_master/frontend/app2/node_modules/react-app-rewired/scripts/test.js --reporters default --reporters jest-junit --coverage
node        6881  2.8  2.6 1065992 194076 ?      Sl   10:37   0:41 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node        6886  2.8  2.6 1067004 195748 ?      Sl   10:37   0:40 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node        6898  2.9  2.5 1058872 187360 ?      Sl   10:37   0:43 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node        6905  2.8  2.4 1054256 183492 ?      Sl   10:37   0:42 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node        6910  2.8  2.6 1067812 196344 ?      Sl   10:37   0:41 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node        6911  2.7  2.6 1063680 191088 ?      Sl   10:37   0:40 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node        6950  0.8  1.9 1018536 145396 ?      Sl   10:38   0:11 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node        7833  0.0  0.0   4340   804 ?        Ss   10:59   0:00 sh
node        7918  0.0  0.0   4240   652 ?        S    11:01   0:00 sleep 3
node        7919  0.0  0.0  17508  2048 ?        R+   11:01   0:00 ps aux
node        7920  0.0  0.0   4396   716 ?        S+   11:01   0:00 cat

From the manual on ps:

S    interruptible sleep (waiting for an event to complete)
l    is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)

So I think what this shows is that the tests have started running fine, spawned child processes to run them in parallel, and then for whatever reason after 40 seconds or so those processes have all gone to sleep and are no longer doing anything.

We're pretty stumped with how to investigate this further, particularly as we have the awkwardness of not easily being able to install whatever we like into the pod for further investigation (no easy root access)... but any suggested theories / next steps would be welcomed!

** EDIT **

It seems idleMinutes wasn't the culprit, as several times today we've seen the issue happen again since reverting it. I've been able to verify that the script was running in a brand new node in kubernetes which hadn't been used by any other builds previously. So now I have no idea what's even changed recently to make this start happening :(

Alyssa
  • 835
  • 1
  • 7
  • 23

1 Answers1

4

Having banged my head against this some more, I'm pretty confident that the root cause was the tests using excessive memory in the pod. We got lucky that for a few builds yesterday we saw an ENOMEM error printed out amongst the logging, before it got stuck in an identical way. I can't explain why we weren't always seeing this (we went back and checked previous examples and it wasn't there), but that's what put us onto the right track.

Doing some more digging around, I happened to run a kubectl top pods in time to catch one of the node pods going crazy - you can see that node-thk0r-5vpzk is using 3131Mi at this particular moment in time, and we'd set the limit on the pod to be 3Gi:

kubectl top pods output

Looking back at the corresponding build in Jenkins, I saw that it was now in the stuck state but with no ENOMEM logging. Subsequent kubectl top pods commands showed the memory had now decreased to a reasonable level in node-thk0r-5vpzk, but clearly the damage was already done as we now had all the child processes in the weird sleep state not doing anything.

This also (potentially) explains why the problem became way more common after I introduced the idleMinutes behaviour - if there's any sort of memory leak then re-using the same pod over and over for npm test will make it more and more likely to hit the memory ceiling and freak out. Our fix for now has been to limit the number of workers using the --maxWorkers setting, which keeps us well below our 3Gi limit. We're also planning to look into the memory usage a bit using --detectLeaks to see if there's something crazy in our tests we can fix to solve the rampant memory usage.

Hoping this can help someone else if they see a similar problem. Just another day in the crazy DevOps world...

Alyssa
  • 835
  • 1
  • 7
  • 23