A Hands-on Tekton CI/CD Pipeline Troubleshooting Journey

Published in

ITNEXT

14 min readNov 14, 2023

In this blog, I will share my experience debugging an issue related to Tekton CI/CD pipelines.

I faced this issue while working on the confidential containers project using the Kata/remote-hypervisor approach.

Before delving further, let me provide a quick primer on Tekton pipelines. The open-source Tekton project provides a Kubernetes-native framework to design and run your CI/CD pipelines.

An essential aspect of Tekton pipelines is that each CI/CD pipeline step runs as a container, allowing each step to be customised as required.

A typical CI/CD pipeline is a set of tasks, each task a set of steps. Further, a task runs as a Kubernetes pod, and each task step is a separate container in the pod.

The diagram below helps to visualise the relationship between task, step, pipeline, and pod.

The task running as a Kubernetes pod makes it possible to customise each step w.r.to container resource requirements, runtime configuration, security policies, attached sidecar, etc. In an earlier blog, you can read more about using this flexibility to create secure pipelines.

If you are wondering why I was interested in verifying Tekton pipelines using Kata or confidential containers, let me briefly state the reason here.

Isolated and secure CI/CD pipelines are essential use cases for Kata and confidential containers.

Let’s consider a pipeline task requiring privileged capabilities for executing some test cases. Using Kata containers to run this task protects the host (worker node) from the workload (pod) and provides the guardrail to run privilege operations safely in the cluster.

Kata container protecting the host from the workload

Similarly, if your pipeline task uses artefact signing keys and you want to protect the keys from the host (worker node), then you need to run this task as a confidential container. Running the pipeline task as a confidential container technically assures that the host or any entity outside of the confidential environment will not have access to the keys. Confidential containers based on Kata containers protect both the host from the workload and the workload from the host.

Confidential container protecting the workload from the host

With this background, let’s dive into the problem and start the debugging journey.

Problem

The Tekton pipeline never completes when using Kata/remote-hypervisor.

I executed the following sample pipeline to test with Kata/remote-hypervisor, and it was never completed.

kata-demos/pipelines at main · bpradipt/kata-demos

Collection of Kata Containers Demos. Contribute to bpradipt/kata-demos development by creating an account on GitHub.

github.com

There were no logs and absolutely no clue as to what was happening (or not happening).

I typically take the following approach to debug such scenarios where an application works fine with one container runtime but doesn’t work with another.

Understand the relationship between different components involved with the application.
Create a working scenario for the application.
Understand and map the working scenario to the failure scenario to gain insights.

Let’s go through each of the steps in this debugging journey.

Understanding the different components involved

The key component here is the Tekton pipeline, which consists of the Tekton task. Let’s start with the Tekton task definition.

As the spec shows, the kaniko Tekton task definition consists of two steps.

Src: https://github.com/bpradipt/kata-demos/blob/main/pipelines/kaniko-task.yaml

Steps in the kaniko task definition:

build-and-push
write-url

The build-and-push step uses the container image “gcr.io/kaniko-project/executor”. The write-url step uses the container image “docker.io/library/bash”.

The kaniko task is part of the build-push pipeline, as can be inferred from the pipeline spec.

Src: https://github.com/bpradipt/kata-demos/blob/main/pipelines/build-push-pipeline.yaml

And the pipeline is executed by creating a pipelinerun object. The pipelinerun spec is shown below.

Src: https://github.com/bpradipt/kata-demos/blob/main/pipelines/pipelinerun.yaml

Note the use of runtimeClassName, which points to kata-remote-cc, the runtimeClassName configured for Kata/remote-hypervisor in my setup.

Let’s see how this maps to the actual deployment by creating the pipelinerun object.

kubectl apply -f ns.yaml
kubectl apply -f registry-secret.yaml
kubectl apply -f kaniko-task.yaml
kubectl apply -f build-push-pipeline.yaml
kubectl apply -f pipelinerun.yaml

The following output shows the pipelinerun pod.

$ kubectl get pods -n kata-pipelines

NAME                                 READY   STATUS    RESTARTS   AGE
build-push-run-kata-build-push-pod   2/2     Running   0          24s

Looking at the pod spec showed the following containers constituting the pod:

$ kubectl get pod -n kata-pipelines build-push-run-kata-build-push-pod -o yaml

prepare (init)
place-scripts (init)
step-build-and-push
step-write-url

So, step-build-and-push and step-write-url are the two steps. Tekton seems to add the prefix “step-” to the steps.

Following is the snippet of the pipelinerun pod spec (kept only relevant entries for brevity).

With this understanding, let’s move to the next step of creating a working scenario.

Create a working scenario

I deployed the pipeline using runc containers by commenting out the podTemplate and the runtimeClassName entry in the pipelinerun spec.

Post-deployment, the first step I followed was to observe the container logs to understand what was happening.

The output below shows that the init containers performed some initialisation work, followed by executing the step-build-and-push.

$ kubectl logs -n kata-pipelines build-push-run-kata-build-push-pod -c prepare

Entrypoint initialization

$ kubectl logs -n kata-pipelines build-push-run-kata-build-push-pod -c place-scripts

Decoded script /tekton/scripts/script-1-ks46d


$ kubectl logs -n kata-pipelines build-push-run-kata-build-push-pod -c step-build-and-push

Enumerating objects: 7, done.
Counting objects:  14% (1/7)^MCounting objects:  28% (2/7)^MCounting objects:  42% (3/7)^MCounting objects:  57% (4/7)^MCounting objects:  71% (5/7)^MCounting objects:  85% (6/7)^MCounting objects: 100%

[snip]

I wanted to exec a shell into the container to examine the processes. However, the default gcr.io/kaniko-project/executor image didn’t have a shell. So I went to the kaniko project website to determine available debug options. I found a reference to the debug image, which provides a busybox shell.

So, I switched the image in the kaniko task definition to use the debug image and recreated the pipelinerun object.

kaniko task definition using debug builder image

Looking at the processes inside the step-build-and-push container shows the following:

$ kubectl exec -it -n kata-pipelines build-push-run-kata-build-push-pod -c step-build-and-push -- ps

PID   USER     TIME  COMMAND
    1 0         0:00 /tekton/bin/entrypoint -wait_file /tekton/downward/ready -wait_file_content -post_file /tekton/run/0/out -termination_path /tekton/termination -step_metadata_dir /tek
   14 0         0:00 /kaniko/executor --dockerfile=Dockerfile --context=git://github.com/bpradipt/container-build.git --destination=quay.io/bpradipt/test-build --digest-file=/tekton/result
   22 0         0:00 ps

$ kubectl exec -it -n kata-pipelines build-push-run-kata-build-push-pod -c step-build-and-push -- pstree -p

entrypoint(1)-+-executor(14)-+-{executor}(15)
              |              |-{executor}(16)
              |              |-{executor}(17)
              |              |-{executor}(18)
              |              |-{executor}(19)
              |              `-{executor}(20)
              |-{entrypoint}(7)
              |-{entrypoint}(8)
              |-{entrypoint}(9)
              |-{entrypoint}(10)
              |-{entrypoint}(11)
              |-{entrypoint}(12)
              `-{entrypoint}(13)

You can see two relevant processes:

“/tekton/bin/entrypoint”
“/kaniko/executor”

The PID 1 process indicates that the associated program (“/tekton/bin/entrypoint”) must either be explicitly mentioned in the pod spec or must be the entrypoint of the container image used in the pod.

I went back to look at the pod spec and searched for the “/tekton/bin/entrypoint” program and could see it in the pod spec. The following snippet shows the relevant entries from the pod spec.

There is also another program, “/kaniko/executor” which is spawned by the entrypoint program (“/tekton/bin/entrypoint”).

I remembered seeing the “/kaniko/executor” program in the pod spec. So I went back to look at the pod spec again and searched for “/kaniko/executor” and found that “/kaniko/executor” is mentioned as part of the “TEKTON_PLATFORM_COMMANDS” env variable.

env:
  - name: TEKTON_PLATFORM_COMMANDS
    value: '{"linux/amd64":["/kaniko/executor"],"linux/arm64":["/kaniko/executor"],"linux/s390x":["/kaniko/executor"]}'
  - name: SSL_CERT_DIR
    value: /tekton-custom-certs:/etc/ssl/certs:/etc/pki/tls/certs:/system/etc/security/cacerts
  image: gcr.io/kaniko-project/executor@sha256:01531afa95baf57abf975f7fd794b8ddac453f10a1b5e4878a9df2a666724205

Further, the arguments to “/kaniko/executor”, as seen in the “ps” output, match with some of the options mentioned under the args attribute in the pod spec. The following snippet shows the relevant options that were also present in the “ps” output.

pipelinerun pod spec snippet showing the options for kaniko-executor

As a logical next step, I wanted to understand the pipeline code, the entrypoint program and discover the purpose of the TEKTON_PLATFORM_COMMANDS environment variable.

The pipeline code is available from the link mentioned below:

GitHub - tektoncd/pipeline: A cloud-native Pipeline resource.

A cloud-native Pipeline resource. Contribute to tektoncd/pipeline development by creating an account on GitHub.

github.com

And here you experience the sheer beauty of open source. It’s a liberating experience to view and understand any code without barriers. It speeds up debugging and problem resolution.

Searching the code pointed out two important places, as shown below:

Looking at the first result, the comment for the method where the environment variable is appended is interesting. Also, you can see how the TEKTON_PLATFORM_COMMANDS environment variable is prepared.

https://github.com/tektoncd/pipeline/blob/f975869e7da8ca128aafe3df9a69c3f7450ab46b/pkg/pod/entrypoint_lookup.go#L94

So I went back to the steps defined in the kaniko task, and indeed, no command is specified in the spec, as can be seen from the kaniko task definition snippet below. There is only an args attribute.

Src: https://github.com/bpradipt/kata-demos/blob/main/pipelines/kaniko-task.yaml#L44-L58

So, it’s confirmed (from the code and the task definition) that Tekton adds this environment variable to the pod spec.

However, this still doesn’t explain how the program mentioned as part of the environment variable TEKTON_PLATFORM_COMMANDS gets executed by the entrypoint program.

The next logical debug target was the entrypoint program. Also, the second search result for the TEKTON_PLATFORM_COMMANDS variable points to the file “cmd/entrypoint/main.go”, which at first glance looks to be the code for the entrypoint program.

Looking at the README under the root directory (“cmd/entrypoint”) fills lots of missing pieces:

entrypoint binary is used to override the entrypoint of a container by wrapping it and executing the original entrypoint command in a subprocess.
Tekton uses this to make sure TaskRuns’ steps are executed in order, only after sidecars are ready and previous steps have completed successfully.
The following flags are available:
-entrypoint: “original” command to be executed (as entrypoint). This will be executed as a sub-process on entrypoint
-post_file: file path to write once the sub-process has finished. If the sub-process failed, it will write to {{post_file}}.err instead of {{post_file}}.
-wait_file: file path to watch before starting the sub-process. It watches for {{wait_file}} and {{wait_file}}.err presence and will either execute the sub-process (in case of {{wait_file}}) or skip the execution, write to {{post_file}}.err and return an error (exitCode >= 0)
-wait_file_content: expects the wait_file to contain actual contents. It will continue watching for wait_file until it has content.
-stdout_path: If specified, the stdout of the sub-process will be copied to the given path on the local filesystem.
-stderr_path: If specified, the stderr of the sub-process will be copied to the given path on the local filesystem. It can be set to the same value as {{stdout_path}} so both streams are copied to the same file. However, there is no ordering guarantee on data copied from both streams.
-enable_spire: If set will enable signing of the results by SPIRE. Signing results by SPIRE ensures that no process other than the current process can tamper the results and go undetected.
-spire_socket_path: This flag makes sense only when enable_spire is set. When enable_spire is set, spire_socket_path is used to point to the SPIRE agent socket for SPIFFE workload API.

Now, mapping this description to the actual command for the pipeline task from the pod spec, you can infer the following:

step-1(container step-build-and-push) waits for file content in the “tekton/downward/ready” file.
step-1 (container step-build-and-push) writes the output to “/tekton/run/0/out”.

step-2 (container step-write-url) waits for the file “/tekton/run/0/out”. Note that it doesn’t wait for the file content (there is no wait_file_content option, as shown in the snippet below).
step-2 (container step-write-url) writes the output to “/tekton/run/1/out”.

With this information, you get an idea about the execution order of the two steps in the kaniko task.

Also, I expected to see the entrypoint option under args in the pod spec as the entrypoint argument indicates the original command that gets executed as part of the kaniko task by the entrypoint program.

For step-1 (container step-build-and-push), I don’t see any entrypoint option. This is the relevant snippet from the pod spec.

pipelinerun pod spec snippet for container *step-build-and-push*

However, for step-2 (container step-write-url) an entrypoint option is present, as can be seen in the snippet below:

pipelinerun pod spec snippet for container step-write-url

So what is happening here?

From the previous steps, you know that TEKTON_PLATFORM_COMMANDS contains the actual program that gets executed, but what connects the “/tekton/bin/entrypoint” to “/kaniko/executor”, i.e., the value for TEKTON_PLATFORM_COMMANDS in step-1?

This is where, again, the entrypoint code comes to the rescue. From the code, you see that if the entrypoint option is missing, then the value of TEKTON_PLATFORM_COMMANDS is used.

Src: https://github.com/tektoncd/pipeline/blob/main/cmd/entrypoint/main.go#L124

Further, the following lines from the entrypoint code indicate how arguments get processed for the entrypoint option.

Src: https://github.com/tektoncd/pipeline/blob/main/cmd/entrypoint/main.go#L91-L92

So “/kaniko/executor” executes with the following options — the options in the pod spec after “- -”.

So, to summarise, a working flow of the build-push Tekton pipeline execution with kaniko task consists of the following:

step-1(container step-build-and-push) waits for file content in the “tekton/downward/ready” file.
step-1 (container step-build-and-push) executes “/kaniko/executor” with specific arguments when file content is available. This is the actual execution of the step.
step-1 (container step-build-and-push) writes the output to “/tekton/run/0/out”.
step-2 (container step-write-url) waits for the file “/tekton/run/0/out”. Note that it doesn’t wait for the file content.
step-2 (container step-write-url) executes a script “/tekton/scripts/script-1-vwnl2” when the file “/tekton/run/0/out” is available.
step-2 (container step-write-url) writes the output to “/tekton/run/1/out”.

You can easily generalise the above flow for other pipelines with different tasks.

With this understanding, let’s look at the problematic scenario — running with Kata/remote-hypervisor (i.e., runtimeClassName: kata-remote-cc) and the cause of the issue.

Understand and map the working scenario to the failure scenario to gain insights

Starting with the logs, the step-build-and-push container logs were empty.

$ kubectl logs -n kata-pipelines build-push-run-kata-build-push-pod -c step-build-and-push

<empty>

Let’s look at the processes inside the step-build-and-push container.

$ kubectl exec -it -n kata-pipelines build-push-run-kata-build-push-pod -c step-build-and-push - ps

PID USER TIME COMMAND
1 0 0:00 /tekton/bin/entrypoint -wait_file /tekton/downward/ready -wait_file_content -post_file /tekton/run/0/out -termination_path /tekton/termination -step_metadata_dir /tek
6 0 0:00 ps

Compared to the working scenario, you see that the “/kaniko/executor” process is not running. Either it’s not started, or it died.

From the previous descriptions, you know that “/kaniko/executor” will be started once there is content in the file “/tekton/downward/ready”.

So, I returned to the working and non-working cases to look at “/tekton/downward/ready” content.

In the working case, the text “READY” was written to the “/tekton/downward/ready” file. However, in the case of Kata/remote-hypervisor (peer-pods), the “/tekton/downward/ready” file remains empty, with no text written to it. As a result, the “wait_file_content” never completes, preventing the execution of step-1.

The following section from the entrypoint README also confirms this.

Waiting for Sidecars
In cases where the TaskRun’s Pod has sidecar containers — including, possibly, injected sidecars that Tekton itself didn’t specify — the first step should also wait until all those sidecars have reported as ready. Starting before sidecars are ready could lead to flaky errors if steps rely on the sidecar being ready to succeed.
To account for this, the Tekton controller starts TaskRun Pods with the first step’s entrypoint binary configured to wait for a special file provided by the Kubernetes Downward API. This allows Tekton to write a Pod annotation when all sidecars report as ready, and for the value of that annotation to appear to the Pod as a file in a Volume. To the Pod, that file always exists, but without content until the annotation is set, so we instruct the entrypoint to wait for the -wait_file to contain contents before proceeding.

So, the entrypoint program waits for contents in a special file provided by the Kubernetes Downward API.

Our previous understanding shows that the entrypoint binary waits for contents in the “tekton/downward/ready” file.

Looking at the pod spec confirms that this file is provided by the Kubernetes Downward API, as shown in the snippet below:

So, “/tekton/downward/ready” should have the content “READY”. But it was empty. This points to some issues with Kubernetes Downward API, as the file contents were not visible inside the pod.

I decided to create a reproducer to verify this line of thought.

I used a sample from the Kubernetes docs and modified it to add runtimeClassName, as shown below.

apiVersion: v1
kind: Pod
metadata:
  name: kubernetes-downwardapi-volume-example
  labels:
    zone: us-est-coast
    cluster: test-cluster1
    rack: rack-22
  annotations:
    build: two
    builder: john-doe
spec:
  runtimeClassName: kata-remote-cc
  containers:
    - name: client-container
      image: registry.k8s.io/busybox
      command: ["sh", "-c"]
      args:
      - while true; do
          if [[ -e /etc/podinfo/labels ]]; then
            echo -en '\n\n'; cat /etc/podinfo/labels; fi;
          if [[ -e /etc/podinfo/annotations ]]; then
            echo -en '\n\n'; cat /etc/podinfo/annotations; fi;
          sleep 5;
        done;
      volumeMounts:
        - name: podinfo
          mountPath: /etc/podinfo
  volumes:
    - name: podinfo
      downwardAPI:
        items:
          - path: "labels"
            fieldRef:
              fieldPath: metadata.labels
          - path: "annotations"
            fieldRef:
              fieldPath: metadata.annotations

The annotation data was visible in the file inside the pod, meaning that the Kubernetes downward API was working. What was the issue with the pipeline task?

I returned to the entrypoint program documentation and re-read the whole thing. The following section from the documentation provided the required insights:

To account for this, the Tekton controller starts TaskRun Pods with the first step’s entrypoint binary configured to wait for a special file provided by the Kubernetes Downward API. This allows Tekton to write a Pod annotation when all sidecars report as ready, and for the value of that annotation to appear to the Pod as a file in a Volume. To the Pod, that file always exists, but without content until the annotation is set, so we instruct the entrypoint to wait for the -wait_file to contain contents before proceeding.

According to the documentation, the pod annotation is only written when all the sidecars report being ready. Therefore, to reproduce the issue, I updated the annotation value after creating the pod and checked if the changes were visible inside the pod.

During my testing, I edited the running pod. I modified the annotation value but found that any changes made to pod labels/annotations were not being propagated when using Kata/remote-hypervisor. However, for regular pods, the changes were being propagated.

This verification confirmed that the root cause of the issue was that the changes to annotations and labels were not propagated when using Kata/remote-hypervisor.

The next step was working on a fix. Working on the fix was comparatively easier since the cause was identified. I submitted a PR in the Kata repo to fix this issue, and it’s fixed now.

Conclusion

The content here will help you to debug issues with Tekton pipelines.

I have learnt that when debugging issues, there is no substitute for understanding the complete technology stack. Navigating the technology stack to uncover its details and finally resolving an issue is a one-of-a-kind experience. I hope you’ll like it as well :-).

If you have any questions, please feel free to contact me.