0

Big title, I know, but it is a very specific issue.

I'm creating a new Jenkins cluster, and trying to use Docker-in-Docker containers to build images, differently from the current Jenkins cluster that uses that ugly-as-hell /var/run/docker.sock. The context of the things being built is a monorepo with some Dockerfiles, with builds running in parallel.

The problem is, when building huge layers (for example, after an yarn install that downloads half of the internet), the step hangs in that Done in XX.XXs and does not goes to the next step, whatever it is.

Sometimes the build passes successfully (generally when I change something in the cluster), but the next ones hangs forever. When it passes, I can build 8 nodejs images in ~28min, but the next ones times out after 60min.

Here follows some code to show how I'm doing this. All the other images have the same template than the provided one.

  • Jenkins pod template:

    apiVersion: "v1"
    kind: "Pod"
    metadata:
      labels:
        name: "jnlp"
        jenkins/jenkins-jenkins-agent: "true"
    spec:
      containers:
      - env:
        - name: "DOCKER_HOST"
          value: "tcp://localhost:2375"
        image: "12345678910.dkr.ecr.us-east-1.amazonaws.com/kubernetes-agent:2.0" # internal image
        imagePullPolicy: "IfNotPresent"
        name: "jnlp"
        resources:
          limits:
            cpu: "1000m"
            memory: "1Gi"
          requests:
            cpu: "500m"
            memory: "500Mi"
        tty: true
        volumeMounts:
        - mountPath: "/home/jenkins"
          name: "workspace-volume"
          readOnly: false
        workingDir: "/home/jenkins"
      - args:
        - "--tls=false"
        env:
        - name: "DOCKER_BUILDKIT"
          value: "1"
        - name: "DOCKER_TLS_CERTDIR"
          value: ""
        - name: "DOCKER_DRIVER"
          value: "overlay2"
        image: "docker:20.10.12-dind-alpine3.15"
        imagePullPolicy: "IfNotPresent"
        name: "docker"
        resources:
          limits:
            memory: "4Gi"
            cpu: "2"
          requests:
            memory: "1Gi"
            cpu: "500m"
        securityContext:
          privileged: true
        tty: true
        volumeMounts:
        - mountPath: "/var/lib/docker"
          name: "docker"
          readOnly: false
        - mountPath: "/home/jenkins"
          name: "workspace-volume"
          readOnly: false
        workingDir: "/home/jenkins"
      nodeSelector:
        spot: "true"
      restartPolicy: "Never"
      volumes:
      - emptyDir:
          medium: ""
        name: "docker"
      - emptyDir:
          medium: ""
        name: "workspace-volume"
    
  • Dockerfile

    # We don't use alpine image due to dependency issues
    FROM node:12.14.1-stretch-slim as base
    
    RUN apt-get update \
      && DEBIAN_FRONTEND=noninteractive apt-get -y install --no-install-recommends \
        apt-utils build-essential bzip2 ca-certificates cron curl g++ git libfontconfig make python \
      && update-ca-certificates \
      && apt-get autoremove -y \
      && apt-get clean \
      && rm -rf /tmp/* /var/tmp/* \
      && rm -f /var/log/alternatives.log /var/log/apt/* \
      && rm -rf /var/lib/apt/lists/* \
      && rm /var/cache/debconf/*-old
    
    ENV NODE_ENV development
    
    # Put here, to optimize caching
    EXPOSE 8043
    
    WORKDIR /opt/app
    RUN chown -R node:node /opt/app
    
    USER node
    
    COPY --chown=node:node package.json yarn.lock .yarnclean /opt/app/
    COPY 100-wkhtmltoimage-special.conf /etc/fonts/conf.d/
    
    RUN yarn config set network-timeout 600000 -g && \
        yarn --frozen-lockfile && \
        yarn autoclean --force && \
        yarn cache clean
    
    FROM base as dev
    
    # --debug and inspect port
    EXPOSE 5858 9229
    COPY --chown=node:node . /opt/app
    RUN npx gulp build && sh ./app-ssl
    
    FROM base as prod
    
    COPY --from=dev /opt/app /opt/app
    
    # Like `npm prune --production`
    RUN yarn --production --ignore-scripts --prefer-offline
    
    CMD ["yarn", "start"]
    
  • The command:

    docker build \
      --network host --force-rm \
      --build-arg BUILDKIT_INLINE_CACHE=1 \
      --cache-from 12345678910.dkr.ecr.us-east-1.amazonaws.com/name-of-my-image:latest \
      --cache-from 12345678910.dkr.ecr.us-east-1.amazonaws.com/name-of-my-image:latest-dev \
      --cache-from 12345678910.dkr.ecr.us-east-1.amazonaws.com/name-of-my-image:${VERSION} \
      --cache-from 12345678910.dkr.ecr.us-east-1.amazonaws.com/name-of-my-image:${VERSION}-dev \
      --tag 12345678910.dkr.ecr.us-east-1.amazonaws.com/name-of-my-image:${VERSION}-dev \
      --tag 12345678910.dkr.ecr.us-east-1.amazonaws.com/name-of-my-image:latest-dev \
      --target dev .
    
  • The end of the log:

    ...
    [2022-01-18T19:37:19.928Z] [4/5] Building fresh packages...
    [2022-01-18T19:37:19.928Z] [5/5] Cleaning modules...
    [2022-01-18T19:37:34.774Z] Done in 486.04s.
    [2022-01-18T19:37:34.774Z] yarn autoclean v1.21.1
    [2022-01-18T19:37:34.774Z] [1/1] Cleaning modules...
    [2022-01-18T19:37:46.952Z] info Removed 0 files
    [2022-01-18T19:37:46.952Z] info Saved 0 MB.
    [2022-01-18T19:37:46.952Z] Done in 12.85s.
    [2022-01-18T19:37:46.952Z] yarn cache v1.21.1
    [2022-01-18T19:38:13.453Z] success Cleared cache.
    [2022-01-18T19:38:13.453Z] Done in 24.21s.
    [2022-01-18T20:28:51.170Z] make: *** [Makefile:21: build-dev] Terminated <=== Pipeline reaches timeout! Look how long it hangs from the previous line.
    script returned exit code 2
    

If anyone needs any more information, please let me know. Thanks!

  • Which Kubernetes version are you using? Do you have any issues running Kubernetes pods - are they creating properly, are they in running state, finishing properly? – Mikolaj S. Jan 20 '22 at 13:03
  • Using K8s 1.21 (EKS). The Pods run just fine, this Kubernetes cluster only has some managing apps (cluster-autoscaler, prometheus, cert-manager, etc.) and Jenkins. All agent pods are finishing properly as well. I'm testing right now if it works without parallelism to see if this is the issue. – Igor Brites Jan 20 '22 at 14:32
  • It works without parralel jobs =/ So I'll use this way while I figure out why the parallel jobs don't work. So if anyone can help, please let me know. – Igor Brites Jan 21 '22 at 19:48

0 Answers0