Reproducibility of complete build instructions

This proposal outlines what information is required in the provenance to reproduce complete build instructions for taskruns. It also suggests where in the provenance to store this information.
tektoncd · Sep 20, 2022 · 406d659 · 406d659
1 parent 5630eaf
commit 406d659
Show file tree

Hide file tree

Showing 2 changed files with 280 additions and 0 deletions.
diff --git a/teps/0122-reproducibility-of-complete-build-instructions.md b/teps/0122-reproducibility-of-complete-build-instructions.md
@@ -0,0 +1,279 @@
+---
+status: propoased
+title: Reproducibility of Complete Build Instructions
+creation-date: '2022-09-14'
+last-updated: '2022-09-14'
+authors:
+- '@chitrangpatel'
+see-also:
+---
+
+# TEP-0122: Reproducibility of Complete Build Instructions
+
+<!-- toc -->
+- [Summary](#summary)
+- [Motivation](#motivation)
+- [Requirements](#requirements)
+- [Proposal](#proposal)
+- [Out-of-scope](#out-of-scope)
+- [Future Work](#future-work)
+<!-- /toc -->
+
+## Summary
+This proposal outlines what information is required in the [provenance](#provenance-for-executed-taskrun)to reproduce complete build instructions for taskruns. It also suggests where in the `SLSA v0.2` [provenance](https://slsa.dev/provenance/v0.2) to store this information. 
+
+## Motivation
+
+[Tekton Chains will report provenance as reproducible if a specific annotation is included in the TaskRun](https://docs.google.com/document/d/12yYP0M_zz5Tc9ftGnLwhDZ_tdZBq4mEod70ngWtHmLs/edit#bookmark=id.1sm6isih1d0n) and it includes Task steps and parameter values in the provenance it generates, however there is a great deal of Task spec contents it ignores (e.g. sidecar definitions) and runtime information it does not include (e.g. workspaces provided at runtime). 
+
+For example the following provenance snippet was generated by Tekton Chains ([v0.8.0](https:/tektoncd/chains/releases/tag/v0.8.0)) from [this chains example Task](https:/tektoncd/chains/blob/main/examples/kaniko/kaniko.yaml) - it contains step information (including the shas of the images run) but is missing other information such as the workspaces and the volumes that backed them:
+
+```yaml
+invocation:
+ configSource: {}
+ parameters:
+ BUILDER_IMAGE: gcr.io/kaniko-project/executor:v1.5.1@sha256:c6166717f7fe0b7da44908c986137ecfeab21f31ec3992f6e128fff8a94be8a5
+ CONTEXT: ./
+ DOCKERFILE: ./Dockerfile
+ EXTRA_ARGS: '[]'
+ IMAGE: '{string us.gcr.io/christiewilson-catfactory/kaniko-chains []}'
+buildConfig:
+ steps:
+ - entryPoint: |
+ set -e
+ echo "FROM alpine@sha256:69e70a79f2d41ab5d637de98c1e0b055206ba40a8145e7bddb55ccc04e13cf8f" | tee $(params.DOCKERFILE)
+ arguments: null
+ environment:
+ container: add-dockerfile
+ image: docker-pullable://bash@sha256:fc742d0c3d9d8f5fb2681062398c04b710cd08c46dac1a8f0a5515687018acb9
+ annotations: null
+ - entryPoint: ""
+ arguments:
+ - $(params.EXTRA_ARGS)
+ - --dockerfile=$(params.DOCKERFILE)
+ - --context=$(workspaces.source.path)/$(params.CONTEXT)
+ - --destination=$(params.IMAGE)
+ - --digest-file=$(results.IMAGE_DIGEST.path)
+ environment:
+ container: build-and-push
+ image: docker-pullable://gcr.io/kaniko-project/executor@sha256:68bb272f681f691254acfbdcef00962f22efe2f0c1e287e6a837b0abe07fb94b
+ annotations: null
+ - entryPoint: |
+ set -e
+ echo $(params.IMAGE) | tee $(results.IMAGE_URL.path)
+ arguments: null
+ environment:
+ container: write-url
+ image: docker-pullable://bash@sha256:fc742d0c3d9d8f5fb2681062398c04b710cd08c46dac1a8f0a5515687018acb9
+ annotations: null
+```
+
+## Requirements
+- It must be possible to use the provenance generated by Tekton Chains for an artifact to create a bit by bit identical (best effort) artifact.
+ - Much of this will depend on the Task itself; note that the SLSA L4 required is "best effort".
+ - The provenance should contain all of the Task definition and runtime information required to reproduce the build.
+- The shas of the images run should continue to be included regardless (as this is part of the dependencies / build instructions)
+- It must be possible for users to construct policies based on the provenance that would allow them to determine if the build is ok - for example if they have policies around what pipeline tasks are acceptable to use, what parameters are acceptable, whether or not sidecars are allowed etc.
+ - It is acceptable that a policy might need to fetch the source tasks and pipelines from version control (or wherever they are stored) but regardless it must be possible for a policy engine to make the decisions it needs from the provenance (and values referenced by the provenance)
+
+## Proposal
+In order to reproduce the build, we need to be able to extract enough information from the provenance to recreate the task run or pipeline run under the same set of configuration as was used when creating the build. The list below shows which fields are required in the provenance for re-creating the task run.
+- [Task Specification](#task-specification)
+- [TaskRef](#taskref)
+- [TaskRun Specification](#taskrun-specification)
+- [Step Specification](#step-specification)
+- [Configuration feature flags](#configuration-feature-flags)
+- [Reproducibility Reason Field if not reproducible](#reproducibility-reason-field-if-not-reproducible) 
+- [Tekton Pipelines and Chains Version](#tekton-pipelines-and-chains-version)
+
+This will allow us to create a provenance for a task run as [shown](#provenance-for-executed-taskrun).
+
+**Generating provenance from the pipeline run is still a [WIP](https:/tektoncd/chains/pull/436). Therefore, this proposal only covers reproducibility of task run.** 
+
+### Task Specification
+The table below provides the name and description of the API fields that the Task requires. In addition, it also lists if that information is required, not required, or provided by the provenance. If it is required, it also shows where the information should be provided.
+
+| | **Metadata** | | |
+| ------ | ------------- | ----- | ----- |
+| **Field Name** | **Description** | **Not Required / Required / Provided by provenance** | **Insert in provenance** |
+| name | Name of the task | Required (for pipeline/task run when referenced) | buildConfig |
+| Spec | | | |
+| resources | resources used by steps | Provided (can be extracted from steps/entrypoint) | |
+| description | description of the task | Not Required | |
+| params | parameters required by task | Provided | |
+| results | results produced by the task | Required(Atleast name, type and properties (in case of object)) | buildConfig |
+| volumes | volume mounted on the container | Required | buildConfig |
+| workspaces | workspace bindings used by the task | Required | buildConfig |
+| steps | Steps performed by the task | Provided | |
+| sidecars | Sidecars running alongside tasks | Required | buildConfig |
+| step template | Specifies a container configuration that will be used as the starting point for all of the Steps in your Task | Required | buildConfig |
+
+### TaskRef
+
+TaskRef is useful for references to existing tasks on the cluster or remote tasks (git, oci bundles and catalog). However, if users need to reproduce the build on a different cluster then those tasks are not guaranteed to be the same. Additionally, we have access to the complete taskSpec from the resolver for the referenced task as well. 
+
+For this reason, we choose to store the taskSpec in the buildConfig in order to accurately reproduce the task run. If the reference is made to a remote task (i.e. git, oci bundles or catalog) then we also store the reference information (e.g. git url, commit-sha, etc.) in invocation.configSource’s URI, Digest and Entrypoint fields.
+
+### TaskRun Specification
+
+The table below provides the name and description of the API fields that the TaskRun requires. In addition, it also lists if that information is required, not required, or provided by the provenance. If it is required, it also shows where the information should be provided.
+
+| | **Metadata** | | |
+| ------ | ------------- | ----- | ----- |
+| **Field Name** | **Description** | **Not Required / Required / Provided by provenance** | **Insert in provenance** |
+| name| Name of the task run| Required (provide an audit trail)| |
+| | **Spec** | | |
+| resources | resources used by task | Provided (can be extracted from steps/entrypoint) | |
+| service account name | name of the service account | Required | buildConfig |
+| params | parameter values provided by taskrun | Provided | |
+| workspaces | workspaces used by the task | Required | buildConfig |
+| pod template | | Required | buildConfig |
+| Timeout | time in which a task should complete | Not Required (can probably use default?) | |
+| StepOverrides | Override Step configuration specified in a Task. Currently we can override compute resources. | Not Required | |
+| SidecarOverrides | Override Sidecar configuration specified in a Task. Currently we can override compute resources. | Not Required | |
+| ComputeResources | Configure compute resources required by the steps in the task. | Required (because this could also indicate a minimum amount of resources required for the task run.) | buildConfig |
+| [TaskSpec](#task-specification) | Specification of the resolved task | See [spec](#task-specification) | buildConfig |
+| [TaskRef](#taskref) | Details of the referenced task | Not Required (Since we save the complete TaskSpec even that of a remote task) | |
+| | **Status** | | |
+| Task Results | Results produced by the task run | Required (for comparing with the recreated results) | buildConfig |
+
+### Step Specification
+
+The table below provides the name and description of the API fields that the Step requires. In addition, it also lists if that information is required, not required, or provided by the provenance. If it is required, it also shows where the information should be provided. **Deprecated fields have not been included in the table below.**
+
+
+| | **Metadata** | | |
+| ------ | ------------- | ----- | ----- |
+| **Field Name** | **Description** | **Not Required / Required / Provided by provenance** | **Insert in provenance** |
+| name | Name of the step | Required | buildConfig |
+| image | Image name for this step | Provided | |
+| command | Entrypoint array | Provided | |
+| args | Arguments to the entrypoint | Provided | |
+| working Dir | Step’s working directory | Required | buildConfig |
+| EnvFrom | List of sources to populate env variables | Required | buildConfig |
+| Env | List of env variables to set in the container | Required | buildConfig |
+| Resources | Compute resources required by this step | Required | buildConfig |
+| VolumeMounts | Volumes to mount in the step’s filesystem | Required | buildConfig |
+| VolumeDevices | List of block devices to be used by the step | Required | buildConfig |
+| ImagePullPolicy | Image Pull Policy | Required | buildConfig |
+| Security Context | security options the step should run with | Required | buildConfig |
+| Script | Contents of an executable file to execute | Provided | |
+| Timeout | Time after which the step times out | Required | buildConfig |
+| Workspaces | list of workspaces from the task that the step wants exclusive access to | Required | buildConfig |
+| OnError | behavior of the container on error | Required | buildConfig |
+| StdOut Config | config of the stdout stream | Required | buildConfig |
+| StdErr Config | config of the stderr stream | Required | buildConfig |
+
+### Configuration Feature Flags
+
+The feature flags that a user/operator specified during installation of Tekton pipelines that lead to the build are also required. These can be added to the **configSource** section of the 
+ [provenance](#provenance-for-executed-taskrun).
+
+### Reproducibility Reason Field if not reproducible
+
+Currently, the metadata section of the provenance includes a boolean field “reproducible”. For SLSA L4, if the task run is not reproducible with the information provided in the provenance, then a reason should be included as well. Since there is no field to include a reason for non-reproducibility in the provenance [schema](https://slsa.dev/provenance/v0.2#schema), we will use the completeness section in metadata along with the reproducible boolean field to indicate why a build was not reproducible. For. e.g. if a build is not reproducible due to missing workspace information, the provenance should mark “metadata.completeness.parameters” as false.
+
+
+### Tekton Pipelines and Chains Version
+
+To ensure that the task run was reproduced by the same controller, we need to save the version of the tektonCD/pipelines and chains used for generating the task run and the provenance, respectively. This information can be added to the “invocation.environments” section of the [provenance](#provenance-for-executed-taskrun).
+
+
+### Provenance for executed TaskRun
+
+The generated provenance for the executed** task run** should contain the following information:
+
+
+```
+ invocation:
+ configSource:
+ feature-flags: # feature flags that led to the build
+ - enable-alpha-api: "alpha"
+ - feature-foo: "bar"
+ URI: str # e.g. "https:/test/tekton-test.git"
+ Digest:
+ sha256 : “123fdf35b4e7b1a56a84b2796aab2827edd65c25” # e.g. could be the commit sha 
+ Entrypoint: “task.yaml” # the yaml config to run
+ parameters:
+ ...
+ environment:
+ tekton_version: # Version of the tekton installation used for task run and provenance
+ pipeline: # str e.g. 0.38
+ chains: # str e.g. 0.3
+ metadata: 
+ reproducible: bool 
+ completeness:
+ parameters: bool
+ environment: bool
+ ...
+ buildConfig:
+ steps:
+ ...
+ sidecars: # Similar to steps
+ - entryPoint:
+ arguments: 
+ environment:
+ container:
+ image:
+ annotations:
+ stepTemplate: # Use step template defined in tektoncd/pipelines
+ env:
+ - name: FOO
+ value: "bar"
+ podTemplate: # Use pod template defined in tektoncd/pipelines
+ schedulerName: volcano
+ securityContext:
+ runAsNonRoot: true
+ runAsUser: 1001
+ volumes:
+ - name: my-cache
+ persistentVolumeClaim:
+ claimName: my-volume-claim
+ spec: # PVC Spec or in general, any Kubernetes Volume spec
+ ...
+ serviceAccountName:
+ computeResources: # minimum compute resources for the task run
+ volumes: # any kubernetes volume that you can specify
+ - name:
+ configMap:
+ name:
+ ...
+ workspaces:
+ - workspaceDeclaration: # workspace declaration as defined in tektoncd/pipelines
+ name:
+ mountPath:
+ readonly:
+ optional:
+ workspaceBinding: # workspace binding as defined in tektoncd/pipelines
+ name:
+ subpath:
+ volumeClaimTemplate:
+ persistentVolumeClaim:
+ emptyDir:
+ configMap:
+ secret:
+ csi:
+ workspaceSpec: # Complete spec of a workspace if not defined in workspaceBinding
+ kind: # kind of workspace being used
+ spec: # any kubernetes volume that you can specify
+ data: # for configMap, secrets
+ ... # Any additional fields used by other workspace kinds
+
+```
+
+
+
+## Out-of-scope
+
+* A **Version** field for Task and TaskRun is required for SLSA L1 but not really required for reproducibility of the build. For SLSA L1, it will be useful to version the tasks and task runs but that work likely merits its own TEP. Similar requirements are expected to emerge for Pipeline and PipelineRun, but this is again out-of-scope future work for SLSA and not needed for reproducibility.
+* Generating provenance from the pipeline run is still a [WIP](https:/tektoncd/chains/pull/436). Therefore, this proposal only covers reproducibility of task run.
+* A **Resolved Inputs** field for TaskRun where “resolved inputs” is a direct encoding of reproducibility (and verified during the build pod itself via the entrypointer). In the long term, it is useful to have reproducibility metadata built into pipelines so that the changes to the TaskRun will need to consider reproducibility and chains does not have to be constantly updated whenever there is a drift. This field will require its own TEP to spec-out the details and therefore it is currently out-of-scope.
+* **Level of reproducibility** in the provenance: 
+ * **Full** -- A user can rerun the exact steps with the exact inputs and get the exact results, with all &lt;choose your favorite hash> signatures matching for inputs and artifacts. Until we have provenance that captures the full toolchain version stack -- possibly even down to the OS kernel version(?) -- we're going to be very hard-pressed to _guarantee_ full reproducibility in the initial implementation.
+ * **Partial** -- We can guarantee that a user can run the exact steps with the exact inputs -- but the output artifacts could change due to toolchain changes. For example, imagine all inputs are identical but there is a new compiler optimization. The output artifact is functionally identical but the signature doesn't match.
+
+
+## Future Work
+
+Since generating provenance from the pipeline run is still a [WIP](https:/tektoncd/chains/pull/436), this can be implemented in the future once we can generate provenance for complete pipeline runs. 
+
diff --git a/teps/README.md b/teps/README.md
@@ -286,3 +286,4 @@ This is the complete list of Tekton teps:
 |[TEP-0118](0118-matrix-with-explicit-combinations-of-parameters.md) | Matrix with Explicit Combinations of Parameters | implementable | 2022-08-08 |
 |[TEP-0119](0119-add-taskrun-template-in-pipelinerun.md) | Add taskRun template in PipelineRun | implementable | 2022-09-01 |
 |[TEP-0120](0120-canceling-concurrent-pipelineruns.md) | Canceling Concurrent PipelineRuns | proposed | 2022-08-19 |
+|[TEP-0122](0122-reproducibility-of-complete-build-instructions.md) | Reproducibility of Complete Build Instructions| proposed | 2022-09-20 |