ADR-0038: Vulnerability provenance attribution — vuln.provenance as a query-time join over gather-time SBOMs¶
Status: Accepted Date: 2026-05-18 Tags: vuln · provenance · sbom · routing · query-primitive · adapter · gather · assessment · phase-7 · phase-10 Related: ADR-0008, ADR-0011, ADR-0028, ADR-0029, ADR-0030, ADR-0031, ADR-0032, ADR-0037, ADR-0039, ADR-0042
Context¶
A vulnerability finding from Stage-2 gather (grype against syft SBOM, per Phase 2 S5-04) reports a CVE against a specific package version present in the analyzed repo's built image. What that finding does not answer is the routing question every downstream stage needs: where is the vulnerable package coming from? The same libxml2-2.9.10-1.amzn2.x86_64 finding has very different remediation paths depending on whether the package is:
- A direct dependency declared in the app's manifest (
package.json/pom.xml/requirements.txt) — fix is a version bump, owned by thevulnerability-remediation--{lang}--{build}plugin. - A transitive dependency several edges deep in the resolved dep tree — fix is potentially an override / resolution constraint at the manifest level, owned by the same plugin but with a different recipe.
- A vendored copy sitting in the source tree without a manifest entry — fix is a source replacement or vendor refresh, often owned by HITL.
- A package installed by the base image's package manager (
apk add,apt-get install,yum install) — fix is a base-image swap, owned by thedistroless-migration--{lang}--{build}plugin (Phase 7) or by patching the base image itself. - A library bundled by the language runtime (a JRE's copy of
xerces, a Node distribution's bundlednpm) — fix is a runtime upgrade, frequently HITL. - Present in both the app layer and the base image (e.g.,
glibcshipped by Debian plus a transitive dep that statically links a copy) — fix coordinates across plugins. - Genuinely unattributable (dynamic loading at runtime, plugin systems, JNI bindings) — fix is HITL with the evidence laid out for the reviewer.
ADR-0028 commits Phase 3 (vuln remediation) to the app-layer task class first and Phase 7 (Chainguard distroless migration) to the base-image task class second. ADR-0031 shapes plugins around the (task × language × build-tool) tuple. ADR-0010 splits work across Discovery → Assessment → Deep Scan → Planning → Execution → Validation → Handoff → Learning. The routing decision — which plugin should handle this CVE? — naturally lives at Stage 1 (Assessment, "is this repo eligible for this task class?") or Stage 3 (Planning, "given the chosen workflow, how do we proceed?"). Neither stage can do its job without the provenance attribution this ADR commits to.
Phase 2's existing SbomProbe (S5-04) already writes the raw syft JSON to <raw_dir>/syft-sbom.json alongside the aggregated sbom.json slice — and syft's raw output preserves per-package layer attribution (locations[].layerID per the syft JSON schema). The data substrate is therefore already gathered. What the architecture lacks is (a) the primitive that consumes it, (b) the adapter contract that lets per-language and per-distro implementations plug in, and (c) the commitment that provenance attribution is a query-time join against gather-time evidence — not a precomputed fact that goes stale every time the CVE feed ticks.
Options considered¶
- Option A — leave routing implicit; let each plugin's
Applicabilitycheck its own manifest. Phase 3'svulnerability-remediation--node--npmplugin already decides whether a recipe applies; extending that to "look only at the npm dep graph" implicitly routes by exclusion (if the CVE's affected package isn't inpackage.json's resolved tree, no recipe matches, plugin returnsApplicability.NotApplicable). Pattern: implicit dispatch via exclusion. Cost: every plugin reimplements partial provenance privately; Stage 1 Assessment can't pre-score eligibility per task class without invoking every plugin's matcher; theBothcase (app-layer + base-image) can never be detected because no plugin sees both layers; theUnknowncase escalates only as "no plugin matched," losing the reviewer-actionable evidence about why. - Option B — precompute provenance at gather time and store per-CVE attribution in
RepoContext. When Phase 2'sCveProberuns, also resolve each finding's provenance against the SBOM's layer attribution and write avulnerability_provenanceslice. Pattern: precomputation as a probe output. Gain: provenance is durable, content-addressed, and free to consume downstream. Cost: provenance is a join with CVE data, and CVE feeds update faster than gather. Every NVD/GHSA/OSV tick invalidates the precomputed slice; either gather re-runs (cost) or the slice goes stale silently (wrong) — neither acceptable. The slice also grows linearly with CVE count for repos with hundreds of findings, most of which no workflow will ever address. - Option C —
vuln.provenanceas a query-time primitive, joined lazily against gather-time SBOM. Introduce a new primitivevuln.provenance(cve_id, package_id, image_ref?) → Provenancein the same architectural slot as the ADR-0030 graph-aware queries, but in a new query family (vuln.*, distinct fromdep_graph.*/import_graph.*/scip.*/test_inventory.*because it's a join across evidence sources rather than a graph traversal). Implementations are plugin-contributed adapters in the ADR-0032 family. The primitive reads the rawsyft-sbom.jsonartifact Phase 2 already writes; it is computed on demand at Stage 1 (per-task-class eligibility scoring) or Stage 3 (per-workflow planning), against whateverVulnIndexsnapshot the workflow committed to. Pattern: query-time join over content-addressed evidence + lazy evaluation. This is the option this ADR adopts.
Decision¶
We adopt Option C. Codewizard-sherpa introduces a new query primitive vuln.provenance and a new adapter family VulnProvenanceAdapter. The primitive returns a Pydantic discriminated-union sum type:
class AppDirect(BaseModel):
kind: Literal["app_direct"]
manifest_path: FilePath # package.json, pom.xml, requirements.txt
dep_id: PackageId # the declared dep that pulls the vulnerable version
class AppTransitive(BaseModel):
kind: Literal["app_transitive"]
manifest_path: FilePath
dep_chain: list[PackageId] # root manifest dep → ... → vulnerable package
chain_depth: int # len(dep_chain) - 1
class AppVendored(BaseModel):
kind: Literal["app_vendored"]
source_path: FilePath # e.g., vendor/, third_party/, shaded JAR contents
detected_via: Literal["file_fingerprint", "shaded_jar_metadata", "go_vendor_modules_txt"]
class BaseImage(BaseModel):
kind: Literal["base_image"]
image_digest: ImageDigest # the image layer this came in via
layer_digest: LayerDigest
distro_pkg: DistroPackage # apk/dpkg/rpm name + version
stage: Literal["build", "runtime", "both"] # multi-stage Dockerfile awareness
class RuntimeBundled(BaseModel):
kind: Literal["runtime_bundled"]
runtime: Literal["jre", "node", "python", "ruby", "dotnet"]
bundled_artifact: FilePath # path inside the runtime distribution
class Both(BaseModel):
kind: Literal["both"]
app_record: AppDirect | AppTransitive | AppVendored
base_record: BaseImage | RuntimeBundled
class Unknown(BaseModel):
kind: Literal["unknown"]
reason: Literal[
"dynamic_load", # plugin system / runtime classloader
"sbom_missing", # SbomProbe outcome != "ran"
"no_adapter_for_distro", # base layer detected but no adapter registered
"no_adapter_for_runtime",
"sbom_layer_attribution_absent", # syft output lacks layerID for this package
]
Provenance = Annotated[
AppDirect | AppTransitive | AppVendored | BaseImage | RuntimeBundled | Both | Unknown,
Discriminator("kind"),
]
The adapter Protocol mirrors ADR-0032's shape:
class VulnProvenanceAdapter(Protocol):
"""One adapter per (language, build-tool) slice OR per (distro, package-manager) slice."""
def attribute(self, cve_id: CveId, package_id: PackageId,
image_ref: ImageRef | None) -> Provenance:
...
def confidence(self) -> AdapterConfidence:
"""High / Degraded / Unavailable per Phase 3 ADR-0010 sum-type discipline."""
...
Adapters are plugin-contributed. App-layer adapters (NpmVulnProvenanceAdapter, MavenVulnProvenanceAdapter, PipVulnProvenanceAdapter, GoModVulnProvenanceAdapter) ship with their respective vulnerability-remediation--{lang}--{build} plugins. Base-image adapters (AlpineVulnProvenanceAdapter, DebianVulnProvenanceAdapter, DistrolessVulnProvenanceAdapter, RhelVulnProvenanceAdapter) ship with distroless-migration--* plugins or with a shared base-image-tooling plugin. Runtime adapters (JreBundledVulnProvenanceAdapter) ship with the JVM-relevant plugins.
When the primitive runs:
- Stage 1 — Assessment (Phase 10): scores eligibility per task class. For
vuln-remediation, eligibility = "at least one open CVE has provenance kind ∈ {app_direct,app_transitive,both}." Fordistroless-migration, eligibility = "at least one open CVE has provenance kind ∈ {base_image,both} and the base image is not already a distroless image."Unknownprovenance does not make the repo eligible for any concrete task class — it routes to the universal HITL fallback plugin. - Stage 3 — Planning (Phase 8): when a specific CVE-driven workflow fires, the Planner queries
vuln.provenanceto confirm the task-class routing decision Stage 1 made and to compose multi-plugin coordination for theBothcase (a CVE inBothmay require an app-layer recipe and a base-image swap PR; the Planner sequences them per ADR-0011's recipe-first ordering). - Stage 2 — Gather (Phases 2 + 14): the primitive is not computed here. The substrate it queries (the raw
syft-sbom.jsonartifact) is gathered here. Continuous gather (Phase 14) re-runsSbomProbewhen theimage-digest:<resolved>declared-input token changes (Phase 2 ADR-0004), which means base-image refreshes naturally invalidate stale provenance through the existing cache-invalidation chain — no new infrastructure needed.
What this ADR does NOT decide:
- The adapter-chain assembly question — given a repo with
node+npmapp source on acgr.dev/chainguard/nodebase image, which adapters should the Bundle Builder invoke and in what order? — is deferred to Phase 7's design pipeline (the first phase that genuinely needs base-image adapters). Until then, Phase 3 calls only the single app-layer adapter (NpmVulnProvenanceAdapter) thevulnerability-remediation--node--npmplugin contributes. - The
risk_score-flavored extensions (call-site count × test coverage × business criticality) flagged in the prior architectural conversation are explicitly not introduced here. Per the "score, don't enumerate" framing, risk scoring is a per-task-class TCCM concern, not a top-level primitive.
Tradeoffs¶
| Gain | Cost |
|---|---|
The Both and Unknown provenance cases become first-class outcomes the Planner and Universal HITL fallback can act on, instead of silently mis-routing or no-op-failing. |
The primitive's value is realized only at Stage 1 Assessment and beyond — Phase 3 (which lacks an Assessment stage) sees only the marginal "refuse-non-app-layer-CVEs" improvement, not the full routing payoff. |
Query-time evaluation against gather-time SBOMs composes with the existing image-digest:<resolved> cache-invalidation token — base-image rotations propagate through the existing pipeline without new infrastructure. |
The provenance join is recomputed every time it's called, with no inter-workflow caching. For repos with hundreds of findings this can be wasteful; Phase 14 may add a vuln_provenance_cache keyed on (sbom_digest, vuln_index_digest) if the lookup volume justifies it. |
Adapters are plugin-contributed, so per-language and per-distro provenance work composes with ADR-0031's extension-by-addition discipline. A new distroless-migration--python--pip plugin ships its own adapter; no kernel edits. |
The adapter-chain assembly question (which adapters to invoke, in what order, for a repo that touches multiple layers) is genuinely complex and is deferred to Phase 7. The deferral is honest, but it means Phase 7's design has real work to do, not just an implementation. |
Unknown(reason) is a typed variant with structured reasons — reviewers get actionable evidence (e.g., "this CVE's package has no locations[].layerID in the SBOM") rather than "we couldn't figure it out." |
Reviewers must learn the new sum-type vocabulary; the seven variants are not all equally common, and beginners may treat app_transitive and app_direct as interchangeable when they shouldn't. Plugin documentation must enumerate the operational difference per variant. |
The primitive composes with ADR-0008's objective-signal trust score — AdapterConfidence is a SignalKind candidate (Phase 3 S6-02 @register_signal_kind). Low-confidence provenance (e.g., file-fingerprint-detected vendored deps) can be reflected in TrustScorer strict-AND. |
Adapters carry an honest-confidence burden — every adapter implementation must report AdapterConfidence correctly, including for the edge cases (e.g., partial layer attribution). This is the same burden the ADR-0032 graph adapters already carry, but the surface widens. |
Phase 2's SbomProbe doesn't change — the raw syft-sbom.json artifact is already on disk with layer attribution. The ADR amounts to "read what we already write." |
The raw artifact's schema is syft's output schema, which Phase 2 deliberately marked extra="allow" (per S5-04 — tool output evolution is out of our control). Adapters must be defensive against syft-version drift; a SyftSchemaAdapter helper that normalizes across syft versions may emerge as rule of three after the third adapter ships. |
Consequences¶
- Phase 3 gets a small, surgical addition, not a redesign. A single acceptance-criterion-grade refuse-mode lands: if a CVE's affected component cannot be located in the app's dep graph, the
vulnerability-remediation--node--npmplugin returnsApplicability.NotApplicable(reason=CVE_NOT_IN_APP_LAYER)rather than producing a wrong fix. This is implementable today by inspecting the npm dep tree alone — it does not require the fullvuln.provenanceadapter machinery to ship in Phase 3. - Phase 7 (Chainguard distroless migration) becomes the first home of the full primitive. When
roadmap-phase-designerruns against Phase 7, the goal-criterion set inherits "ship thevuln.provenanceprimitive + at least one app-layer adapter + at least one base-image adapter + theBaseImageandBothprovenance variants exercised in fixtures." This is the bounded additive cross-task primitive case ADR-0039 allows: no existing plugin or stable existing behavior changes, but the new primitive is ADR-backed and becomes part of the next stable contract surface. - Phase 10 (Stage 0 Discovery + Stage 1 Assessment) becomes the first consumer for routing. When
roadmap-phase-designerruns against Phase 10, Assessment-stage scoring per task class is computed viavuln.provenanceover the repo's open CVEs. This sharpens what was previously described loosely as "score eligibility per task class." - Phase 8 (Hierarchical Planner) gets a coordination requirement for the
Bothcase. A CVE inBothmay produce two PRs from two plugins — coordinated, sequenced per ADR-0011, with both attributable to the sameRemediationWorkflow. This is design input for Phase 8; it does not pre-specify the implementation. - No Phase 2 changes are needed. The raw
syft-sbom.jsonartifact (S5-04) already preserves per-package layer attribution. This ADR is "wire new readers to existing evidence," not "produce new evidence." A Phase 2 follow-up story is not introduced by this ADR. - The
Unknown(reason="sbom_layer_attribution_absent")variant is a forcing function. If a future syft version (or a future scanner replacing syft) drops layer attribution, the symptom is visible — every repo's provenance becomesUnknown, eligibility evaporates, and the symptom surfaces in the audit log rather than as silent mis-routing. This is the same shape Phase 3 ADR-0007 establishes for honest failure. - The
risk_scoreextension is explicitly out of scope. Per the prior architectural conversation, "score, don't enumerate" risk weighting is a per-task-class TCCM concern; it does not ride onvuln.provenance. Future task classes that need risk scoring define their own TCCM query types — they do not extend this primitive. - The continuous-gather staleness story is already covered. Phase 14's incremental re-gather, triggered by changes to the
image-digest:<resolved>declared-input token, naturally invalidates base-image-attributed provenance when the base image rotates. CVE-feed-only changes do not invalidate gather-time SBOM artifacts (correctly — the SBOM hasn't changed); they invalidateVulnIndex, and the next provenance query joins against the fresh index. No new staleness infrastructure is introduced.
Reversibility¶
Medium. Removing the primitive after Phase 7 ships would require: (a) unwinding Phase 10 Assessment-stage routing (the Stage 1 scoring per task class is structurally dependent on the primitive), (b) reverting Phase 7's plugin adapter contributions, (c) deciding what to put in the place of the Both and Unknown routing outcomes (likely a regression to Option A's implicit-dispatch-via-exclusion failure mode). The cost is bounded because the primitive composes additively with existing ADRs (0030/0031/0032), but reversing it after a multi-plugin Phase 7 ships would mean weeks of unwinding. Reversing it before Phase 7 ships (e.g., if early Phase 7 design discovers a fatal flaw) is much cheaper — the only landed artifact would be the Phase 3 refuse-mode acceptance criterion, which is trivially removable.
Evidence / sources¶
- ADR-0008 — the objective-signal trust score model
AdapterConfidencecomposes with. - ADR-0010 — the seven-stage pipeline whose Stage 1 (Assessment) and Stage 3 (Planning) consume this primitive.
- ADR-0011 — recipe-first ordering for the
Bothcase multi-plugin coordination. - ADR-0028 — Phase 3 (app-layer) before Phase 7 (base-image), which makes this primitive necessary at the Phase 7 boundary.
- ADR-0029 — TCCMs that may reference
vuln.provenancein their query bands. - ADR-0030 — the query-primitive architectural slot this primitive joins (in a new
vuln.*family, parallel todep_graph.*/import_graph.*/scip.*/test_inventory.*). - ADR-0031 — the plugin bundle that owns per-language and per-distro adapter contributions.
- ADR-0032 — the adapter-contract pattern this primitive's adapters follow.
- ADR-0037 — the substrate-boundary commitment that constrains how this primitive consumes its evidence (gather-time SBOM, not interactive LSP).
- Phase 2 ADR-0004 — the
image-digest:<resolved>cache-invalidation token that propagates base-image rotations. - Phase 2 S5-04 — the gathered
syft-sbom.jsonraw artifact this primitive reads. - Phase 3 ADR-0010 — the discriminated-union sum-type discipline
Provenancefollows. - syft JSON schema —
locations[].layerIDis the load-bearing field provenance attribution reads. - External framing (paraphrased from architectural conversation 2026-05-18): "for app-layer vulns, figure out where the vulnerable library is coming from — what import vs if it's coming from a base image." This ADR is the production-level commitment that answer is computable.