Story S6-02 — TrustScorer with constructor-injected EventLog + SignalKind open registry (Gap 5 fix)¶
Step: Step 6 — RemediationOrchestrator, TrustScorer, two-stream EventLog, SubgraphNode Protocol, end-to-end happy path
Status: HARDENED
Effort: M
Depends on: S6-01
ADRs honored: ADR-0001 (TrustScorer.__init__(event_log) is the named Phase-5 contract — constructor-injection is mandatory), ADR-0005 (the injected log is the two-stream EventLog), ADR-0010 (TrustOutcome tagged-union, SignalKind newtype), Phase 5 ADR-0003 (Phase 5 widens via @register_signal_kind)
Validation notes — 2026-05-19¶
Verdict: HARDENED. The Phase-3 commitments (constructor-injection, strict-AND, open registry) were already strong; the executor needs tighter tests + extra ACs so a wrong impl cannot slip through. Changes applied:
- AC-DUP rewritten. The previous AC was self-contradictory ("idempotent for the same
(name)call, but raises … if called twice from different modules").PluginRegistry.registeris the named precedent and it always raises on duplicate (src/codegenie/plugins/registry.py:119-122). The hardened AC pins "always raises; carries.name,.existing_origin,.new_origin" to mirrorPluginAlreadyRegistered(name, existing, duplicate)in errors.py:74-78. - "Decorator-like helper" terminology corrected → "registration helper / function call."
register_pluginis explicitly a function call, not a decorator (plugins/registry.py:10-16); same applies here. The "decorator" framing in Phase-2 / Phase-3 prose refers to@register_probe, not this function-call shape. failingordering pinned (Coverage F2 + Test-Quality F2). Notes already said "ordered as input"; now an AC + tighter parametrize test (list equality, not set equality) enforces it.pytest.raises(Exception)tightened topytest.raises(pydantic.ValidationError)(Test-Quality F1) — a wrong impl raisingTypeErrorwould have silently passed.detailsrejection parametrized overlist,None,bytes,datetime, nested-dict, rawobject(Coverage F3). Single-case test let a permissive impl through.- Stateless
score()AC + test (Coverage F1). Notes mandated this but no AC verified it. The laziest wrong impl caches_degraded_flagin__init__; the new test emits anAdapterDegradedbetween twoscore()calls and asserts the second outcome reflects it. - Empty-signals AC (Coverage F4). Architecture pins Stage 6 to exactly 5 signals;
score([])is a caller bug. Pin a typedEmptySignalsraise so a wrong impl can't silently returnpassed=True, confidence="high". - Cross-event-type test (Coverage F5). Emitting a non-
AdapterDegradedinternal event (PluginResolved) with matchingworkflow_idmust NOT flip confidence. The previous tests only ruled out cross-workflow leakage, not cross-event-type leakage. outcome.signalspreservation test (Coverage F6). AC said "preserved verbatim"; no test asserted list order or membership.- Hypothesis property test (Test-Quality F5). The confidence fold is a clean property — for any sequence of emitted events,
confidence == "degraded" iff any(matching AdapterDegraded). Pinning it as Hypothesis future-proofs against subtle filter bugs. - Pure-helper discipline added (Design-Patterns F2 — functional core / imperative shell, CLAUDE.md §Conventions).
_compute_strict_and(signals)and_has_adapter_degraded_for_workflow(events, workflow_id)are pure;score()is the imperative shell. An AST-walking AC + a small purity test pin the split (matches the Phase-1 / Phase-2 probe discipline). - Import-time registration AC (Consistency F7). Pins that
import codegenie.transforms.signal_kinds(or transitively viatrust_scorer) registers the 5 Phase-3 kinds. Notes-for-implementer adds the obligation thattransforms/__init__.pyimportssignal_kindsto make the registration unconditional for anyfrom codegenie.transforms import …consumer. test_phase3_five_kinds_registered_at_importstrengthened (Test-Quality F3). Old test was satisfied by any prior import; new assertion equates the module-attrBUILDto a registry-lookup result and walks the AST ofsignal_kinds.pyto confirm 5 top-levelregister_signal_kind(...)calls — survives the mutation "the registrations moved into a function never called."- Notes for the implementer extended with: (a) the 5th-registry audit anchor + kernel-extract deferral (per
plugins/registry.py:18-49); (b) the option to useInMemorySinkfrom S6-01 for faster tests; (c) the explicittransforms/__init__.pyimport obligation.
Goal, scope, and Phase-5 contract surface unchanged.
Context¶
TrustScorer is the strict-AND scoring kernel. Phase 3 registers 5 signal kinds (build, install, tests, lockfile_policy, cve_delta); Phase 5 widens with trace, policy (05-ADR-0003); Phase 7 widens again with baseimage, shell_presence. Each addition is a new file with @register_signal_kind("name") — no edits to TrustScorer.score. The score is strict-AND: any passed=False signal → TrustOutcome.passed=False with failing=[...kinds...].
The architecture spec's Gap 5 (../phase-arch-design.md §Gap analysis & improvements §Gap 5) called out that the three lens designs left "how the orchestrator obtains the EventLog instance the TrustScorer reads" implicit. Two options were on the table:
- Ambient state:
TrustScorer.score(signals)walksos.environ["CODEGENIE_WORKFLOW_ID"]and discovers the per-workflow log on disk. - Constructor injection:
TrustScorer(event_log)receives the log explicitly;score(signals)readsevent_log.replay()forAdapterDegradedmarkers and foldsconfidence: Literal["high", "degraded"]into the outcome.
Ambient state is the textbook anti-pattern (hidden coupling, unmockable in tests, breaks under concurrent workflows in the same process). ADR-0001 picks constructor injection explicitly (§Consequences: "TrustScorer.__init__(event_log: EventLog) (constructor-injection per Gap 5 in the architecture spec) — the scorer reads its workflow's event stream to fold AdapterDegraded events into TrustOutcome.confidence. Ambient-state alternative rejected."). ADR-0005 §Consequences reasserts: "TrustScorer reads its own workflow's internal stream for AdapterDegraded markers — this is the ambient-state alternative rejected in ADR-0001 (constructor-injected EventLog instead)."
The open SignalKind registry (@register_signal_kind("name")) is the seam Phase 5 and Phase 7 extend. Adding a new signal kind is one decorator call in a new module — no edits here. The registry mirrors the PluginRegistry / RecipeRegistry / IndexFreshnessRegistry shape already established in the codebase (CLAUDE.md §Open/Closed seams).
References — where to look¶
- Architecture:
../phase-arch-design.md §Component design C6—TrustScorerpublic interface,SignalKindopen registry, confidence-propagation semantics, the "this is mildly cyclical — replay-tested" note.../phase-arch-design.md §Data model(lines ~832–844) —TrustSignal,TrustOutcomePydantic shapes (extra="forbid",frozen=True,detailsis primitives-only).../phase-arch-design.md §Gap analysis & improvements §Gap 5— the gap this story closes; reads the ambient-state vs. constructor-injection tradeoff.../phase-arch-design.md §Control flowstep 8 — Stage 6 collects 5TrustSignals and passes them toTrustScorer.score(...).- Phase ADRs:
../ADRs/0001-ship-phase5-contract-surface-by-name.md§Consequences row 5 — constructor injection ofEventLogis the Phase-5 contract.../ADRs/0005-two-stream-event-log-per-adr-0034.md§Consequences —TrustScorerreads the internal stream forAdapterDegradedmarkers.../ADRs/0010-domain-modeling-discipline-scope-sum-type-and-newtypes.md§Decision (3) —TrustOutcomeis a Pydantic discriminated union pattern;SignalKindis aNewType.- Cross-phase precedent:
../../05-sandbox-trust-gates/ADRs/0003-trustscorer-extension-via-signal-kind-registry.md— Phase 5's widening proves the registry's extension-by-addition shape.../../05-sandbox-trust-gates/final-design.md §6—GateRunner.runconsumesTrustScorer.score; the signature shipped here is the signature Phase 5 wraps.- Existing code to mirror:
src/codegenie/probes/registry.py—@register_probeinstance-with-default-singleton shape (CLAUDE.md §Registry-dispatched coordinator).src/codegenie/indices/freshness.py(Phase 2) —@register_index_freshness_check(IndexName)shape; the closest existing analog.- This phase, parallel stories:
- S6-01 — the
EventLogthis scorer reads from;AdapterDegradedis one of the 16 internal-stream variants. - S6-04 — the orchestrator constructs
TrustScorer(event_log=self._event_log)and passes the 5 signals at Stage 6. - S5-04 —
LockfilePolicygenerates thelockfile_policyTrustSignalpayload. - S1-01 — provides the
SignalKindnewtype.
Goal¶
Land src/codegenie/transforms/trust_scorer.py exposing TrustScorer(event_log) with score(signals: list[TrustSignal]) -> TrustOutcome; strict-AND on passed; confidence folded from AdapterDegraded events in event_log.replay() filtered to the constructor-supplied workflow_id. Also land src/codegenie/transforms/signal_kinds.py with the @register_signal_kind("name") open registry; Phase 3 registers build, install, tests, lockfile_policy, cve_delta at import time.
Acceptance criteria¶
- [ ] AC-1 (module surface).
src/codegenie/transforms/trust_scorer.pyexists;from codegenie.transforms.trust_scorer import TrustScorer, TrustSignal, TrustOutcome, UnregisteredSignalKind, EmptySignalssucceeds. - [ ] AC-2 (constructor injection mandatory — Gap 5).
TrustScorer.__init__(self, event_log: EventLog) -> Nonerequires theevent_logargument — no default value, noOptional. ConstructingTrustScorer()is aTypeError. Per ADR-0001 §Consequences row 5, ambient-state lookup is rejected; noos.environ/ thread-local / classmethod alternative exists on the class. - [ ] AC-3 (strict-AND on
passed).score(self, signals: list[TrustSignal]) -> TrustOutcomecomputesoutcome.passed = all(s.passed for s in signals)for non-emptysignals. - [ ] AC-4 (
failingis order-preserving, not sorted).outcome.failing == [s.kind for s in signals if not s.passed]— input order is preserved; the implementation MUST NOT sort, deduplicate, orfrozenset-the list. Pinned because future consumers (Phase 5's gate-runner; the remediation-report writer S5-05) read the order to display the first-failing signal to humans. - [ ] AC-5 (
signalspreserved verbatim).outcome.signals == signals; each element is the sameTrustSignalinstance (frozen, identity-bearing via Pydanticmodel_copy(deep=False)semantics). The scorer is a fold, not a transformer — it must NOT rebuild signal objects. - [ ] AC-6 (confidence fold filtered by
workflow_id).outcome.confidence == "degraded"if any event yielded byself._event_log.replay()is anAdapterDegradedwhoseworkflow_id == self._event_log.workflow_id; otherwise"high". The filter is on both event-type and workflow_id; neither alone flips confidence. - [ ] AC-7 (
TrustSignalshape).TrustSignalis afrozen=True, extra="forbid"Pydantic model with fieldskind: SignalKind(theNewTypefromcodegenie.types.identifiers, not a fresh declaration),passed: bool,details: dict[str, str | int | bool | float]. Non-primitivedetailsvalues (list, tuple, None, bytes, datetime, nested dict, arbitrary objects) raisepydantic.ValidationErrorat construction. The S1-05 AST fence forbidsdict[str, Any]anywhere in this module. - [ ] AC-8 (
TrustOutcomeshape).TrustOutcomeis afrozen=True, extra="forbid"Pydantic model with fieldspassed: bool,failing: list[SignalKind],signals: list[TrustSignal],confidence: Literal["high", "degraded"].confidenceis a closedLiteral, not a sum type — ADR-0010's tagged-union discipline applies only where variants carry payload (Notes for implementer §5). - [ ] AC-9 (unregistered-kind rejection).
score(...)raisesUnregisteredSignalKind(kind)if anysignal.kindis not in theSignalKindregistry at call time. This is the only validationscoreperforms against signals; registry membership failure is a programming error, not a data error.UnregisteredSignalKindis declared intrust_scorer.pyand carries a typed.kind: SignalKindattribute. - [ ] AC-10 (empty-signals rejection).
score([])raisesEmptySignals(typed, no payload). The architecture pins Stage 6 to exactly 5 signals (phase-arch-design.md §Control flowstep 8); an empty list is a caller bug. The alternative — silently returningpassed=True, failing=[], confidence="high"— would mis-report a broken Stage-6 collection as a successful workflow. - [ ] AC-11 (
signal_kindsmodule surface).src/codegenie/transforms/signal_kinds.pyexists; exportsSignalKindRegistry,SignalKindAlreadyRegistered,register_signal_kind(name: str, *, registry: SignalKindRegistry | None = None) -> SignalKindregistration helper (function call, NOT a class decorator — mirrorsregister_plugin's shape perplugins/registry.py:10-16module docstring),signal_kind_registry: Final[SignalKindRegistry]singleton, and the 5 Phase 3 module-level registrations in this exact order:BUILD = register_signal_kind("build"),INSTALL = register_signal_kind("install"),TESTS = register_signal_kind("tests"),LOCKFILE_POLICY = register_signal_kind("lockfile_policy"),CVE_DELTA = register_signal_kind("cve_delta"). - [ ] AC-12 (import-time registration is the registration mechanism). Loading
codegenie.transforms.signal_kinds(directly or transitively viacodegenie.transforms/codegenie.transforms.trust_scorer) registers the 5 Phase-3 kinds.transforms/__init__.pyimportssignal_kindsat module-import time so anyfrom codegenie.transforms import TrustScorerconsumer observes the populated registry. A test asserts the registry contains the 5 names after a freshimportlib.import_module("codegenie.transforms")(run in a subprocess so prior-test side effects can't satisfy it). - [ ] AC-13 (duplicate-name rejection, mirrors
PluginAlreadyRegistered).register_signal_kind(name, registry=r)raisesSignalKindAlreadyRegistered(name, existing, duplicate)ifnameis already inr. The exception carries a typed.name: SignalKindattribute and.existing/.duplicatestring fields naming bothmodule.qualnamecall sites (mirrorsPluginAlreadyRegistered.__init__(name, existing, duplicate)insrc/codegenie/plugins/errors.py:74-78). There is no idempotent path — every duplicate raises, regardless of caller. Tests that need re-registration usesignal_kind_registry.fresh(). - [ ] AC-14 (per-test isolation).
SignalKindRegistry.fresh()is a classmethod returning a clean instance with zero registrations; tests construct fresh registries and pass them throughregister_signal_kind(name, registry=fresh)(matches Phase 3 ADR-0002'sPluginRegistry()per-test instance discipline —plugins/registry.pylines 175-186). - [ ] AC-15 (functional core / imperative shell).
trust_scorer.pyexposes two pure helpers —_compute_strict_and(signals: list[TrustSignal]) -> tuple[bool, list[SignalKind]]and_has_adapter_degraded_for_workflow(events: Iterable[Event], workflow_id: WorkflowId) -> bool— and the impure shellTrustScorer.score. The pure helpers MUST NOT touch the event log, the filesystem, or any module-level mutable state; a small AST-walk test asserts the helpers do not nameEventLog,replay,open,Path, orosin their bodies (matches the Phase-1 / Phase-2 functional-core discipline noted in CLAUDE.md §Conventions). - [ ] AC-16 (stateless across calls). Calling
scorer.score(...)twice on the sameTrustScorerinstance, with anAdapterDegradedevent emitted to the injected log between the two calls, the second outcome'sconfidencereflects the new event (i.e. flips from"high"to"degraded"). The scorer MUST NOT cache the degraded flag in__init__or memoize acrossscore()calls. A test pins this directly. - [ ] AC-17 (cross-event-type safety). Emitting a non-
AdapterDegradedinternal event (e.g.PluginResolved) withworkflow_id == self._event_log.workflow_iddoes NOT flipconfidenceto"degraded". The filter is on event-type AND workflow_id, not workflow_id alone. - [ ] AC-18 (cross-workflow safety). An
AdapterDegradedevent with a differentworkflow_idin the same log does NOT flipconfidence. (Sibling of AC-17 on the workflow_id dimension.) - [ ] AC-19 (strict-AND parametric, list-equality). Strict-AND across all
2^5 = 32signal combinations is unit-tested (parametrized overitertools.product([False, True], repeat=5)). The test assertsout.failing == [k for k, p in zip(kinds, combo) if not p](list equality, notset(out.failing) == {...}) so a sorted-failingimplementation does NOT pass. - [ ] AC-20 (no module-level mutable state). No module-level mutable state outside the registry singleton (
signal_kind_registry) — no_cached_*, no module-leveldict[..., ...] = {}, noFinal[set[...]] = set()populated lazily. Pinned by AST-walk + import-purity test. - [ ] AC-21 (TDD). TDD red test exists, committed, green.
- [ ] AC-22 (gates clean).
ruff format,ruff check,mypy --strictclean.
Implementation outline¶
- Write
tests/unit/transforms/test_trust_scorer.py(red); confirmImportError. - Create
src/codegenie/transforms/signal_kinds.py: class SignalKindRegistry:__init__(self) -> None: self._kinds: dict[SignalKind, str] = {}(value is themodule.qualnameorigin string, mirrorsPluginRegistry._originsatplugins/registry.py:101).register(self, name: str, *, origin: str) -> SignalKind: collision check then insert; raiseSignalKindAlreadyRegistered(name, existing=..., duplicate=origin)on duplicate.__contains__(self, kind: SignalKind) -> bool.@classmethod def fresh(cls) -> "SignalKindRegistry": returns a new empty instance.
class SignalKindAlreadyRegistered(CodegenieError): typed.name: SignalKind,.existing: str,.duplicate: str; message names both call sites (mirrorsPluginAlreadyRegisteredinplugins/errors.py:74-78).signal_kind_registry: Final[SignalKindRegistry] = SignalKindRegistry().def register_signal_kind(name: str, *, registry: SignalKindRegistry | None = None) -> SignalKind: function call (NOT a class decorator); resolvesoriginviainspect.stack()[1]'s frame info ("<module>.<qualname>") so the duplicate error names both call sites without the caller having to passoriginexplicitly. Delegates to(registry or signal_kind_registry).register(name, origin=origin).- The 5 Phase 3 registrations as module-level calls — execute at import time; that is the registration mechanism.
- Create
src/codegenie/transforms/trust_scorer.py: - Imports:
TrustSignal,TrustOutcome(Pydantic models defined in this file),EventLog,AdapterDegradedfromcodegenie.plugins.events,signal_kind_registryfromcodegenie.transforms.signal_kinds. class UnregisteredSignalKind(CodegenieError): typed.kind: SignalKind.class EmptySignals(CodegenieError): no payload.- Pure helpers (functional core; AC-15):
def _compute_strict_and(signals: list[TrustSignal]) -> tuple[bool, list[SignalKind]]: returns(all(s.passed for s in signals), [s.kind for s in signals if not s.passed]). No I/O, no log reads.def _has_adapter_degraded_for_workflow(events: Iterable[Event], workflow_id: WorkflowId) -> bool: returnsany(isinstance(e, AdapterDegraded) and e.workflow_id == workflow_id for e in events). Takes an iterable of events; does NOT touch the log.
- Imperative shell:
class TrustScorer:__init__(self, event_log: EventLog) -> None: self._event_log = event_log.def score(self, signals: list[TrustSignal]) -> TrustOutcome:- If
not signals: raiseEmptySignals. - Validate every
signal.kind in signal_kind_registry— raiseUnregisteredSignalKind(kind)on miss. - Call
_compute_strict_and(signals)forpassed,failing. - Call
_has_adapter_degraded_for_workflow(self._event_log.replay(), self._event_log.workflow_id);confidence = "degraded" if … else "high". - Return
TrustOutcome(passed=passed, failing=failing, signals=signals, confidence=confidence).
- Update
src/codegenie/transforms/__init__.pyto importsignal_kinds(so the registrations land for anyfrom codegenie.transforms import …consumer) and re-exportTrustScorer,TrustSignal,TrustOutcome,UnregisteredSignalKind,EmptySignalsper ADR-0001 §Consequences. Module-levelimport codegenie.transforms.signal_kinds # noqa: F401 -- module-level registrationsis the canonical pattern. - Run
ruff format,ruff check,mypy --strict,pytest.
TDD plan — red / green / refactor¶
Red — write the failing test first¶
Test file path: tests/unit/transforms/test_trust_scorer.py.
# tests/unit/transforms/test_trust_scorer.py
import ast
import importlib
import subprocess
import sys
import textwrap
from datetime import datetime, timezone
from itertools import product
from pathlib import Path
import pytest
from hypothesis import given, strategies as st
from pydantic import ValidationError
from codegenie.plugins.events import (
AdapterDegraded, EventLog, InMemorySink, PluginResolved,
)
from codegenie.transforms.trust_scorer import (
EmptySignals, TrustOutcome, TrustScorer, TrustSignal, UnregisteredSignalKind,
)
from codegenie.transforms.signal_kinds import (
BUILD, CVE_DELTA, INSTALL, LOCKFILE_POLICY, TESTS,
SignalKindAlreadyRegistered, SignalKindRegistry,
register_signal_kind, signal_kind_registry,
)
from codegenie.types.identifiers import EventId, PluginId, SignalKind, WorkflowId
WF = WorkflowId("01HFEEDFACE0000000000000000")
OTHER_WF = WorkflowId("01HOTHERWORKFLOW000000000000")
def _log(tmp_path: Path, wf: WorkflowId = WF) -> EventLog:
# InMemorySink (S6-01 AC-2) keeps tests fast; tmp_path used for dir layout only.
return EventLog(root=tmp_path, workflow_id=wf, sink=InMemorySink())
def _ad(wf: WorkflowId = WF, eid: str = "01H...01", reason: str = "parse_error") -> AdapterDegraded:
return AdapterDegraded(
event_id=EventId(eid), workflow_id=wf,
timestamp=datetime.now(timezone.utc), adapter="dep_graph", reason=reason)
def _sig(kind: SignalKind, passed: bool = True) -> TrustSignal:
return TrustSignal(kind=kind, passed=passed, details={})
# --- AC-2: constructor injection mandatory ---------------------------------
def test_constructor_requires_event_log():
with pytest.raises(TypeError):
TrustScorer() # type: ignore[call-arg]
def test_no_ambient_state_alternative_on_class():
# Defend Gap 5: no classmethod / module-level helper that resolves
# event_log from os.environ or any thread-local.
forbidden = {"from_env", "from_ambient", "current", "for_workflow"}
assert not (forbidden & set(dir(TrustScorer)))
# --- AC-3 / AC-4 / AC-19: strict-AND, list-equality on `failing` -----------
def test_strict_and_all_pass(tmp_path: Path):
scorer = TrustScorer(event_log=_log(tmp_path))
signals = [_sig(k, True) for k in (BUILD, INSTALL, TESTS, LOCKFILE_POLICY, CVE_DELTA)]
out = scorer.score(signals)
assert out.passed is True
assert out.failing == []
assert out.confidence == "high"
@pytest.mark.parametrize("combo", list(product([False, True], repeat=5)))
def test_strict_and_2_to_5_preserves_input_order(tmp_path: Path, combo):
scorer = TrustScorer(event_log=_log(tmp_path))
kinds = [BUILD, INSTALL, TESTS, LOCKFILE_POLICY, CVE_DELTA]
signals = [_sig(k, p) for k, p in zip(kinds, combo)]
out = scorer.score(signals)
assert out.passed == all(combo)
# List equality (NOT set) — a sorted-`failing` implementation must fail.
assert out.failing == [k for k, p in zip(kinds, combo) if not p]
def test_failing_preserves_caller_order_not_sorted(tmp_path: Path):
# Pin the no-sort discipline directly: reversed kind order in the input
# produces a reversed `failing` list.
scorer = TrustScorer(event_log=_log(tmp_path))
rev = [CVE_DELTA, LOCKFILE_POLICY, TESTS, INSTALL, BUILD]
signals = [_sig(k, False) for k in rev]
out = scorer.score(signals)
assert out.failing == rev # NOT sorted alphabetically
# --- AC-5: signals preserved verbatim --------------------------------------
def test_outcome_signals_preserved_verbatim(tmp_path: Path):
scorer = TrustScorer(event_log=_log(tmp_path))
signals = [_sig(BUILD, True), _sig(INSTALL, False)]
out = scorer.score(signals)
assert out.signals == signals
assert [id(s) for s in out.signals] == [id(s) for s in signals]
# --- AC-6 / AC-17 / AC-18: confidence fold -----------------------------------
def test_confidence_degrades_when_adapter_degraded_matches_workflow(tmp_path: Path):
log = _log(tmp_path)
log.emit_internal(_ad())
log.flush()
scorer = TrustScorer(event_log=log)
out = scorer.score([_sig(BUILD)])
assert out.confidence == "degraded"
def test_confidence_high_when_adapter_degraded_is_other_workflow(tmp_path: Path):
log = _log(tmp_path)
log.emit_internal(_ad(wf=OTHER_WF))
log.flush()
scorer = TrustScorer(event_log=log)
out = scorer.score([_sig(BUILD)])
assert out.confidence == "high"
def test_confidence_high_when_internal_event_is_not_adapter_degraded(tmp_path: Path):
# AC-17 — same workflow_id, but the event is PluginResolved, not
# AdapterDegraded. Confidence must NOT flip.
log = _log(tmp_path)
log.emit_internal(PluginResolved(
event_id=EventId("01H...PR"), workflow_id=WF,
timestamp=datetime.now(timezone.utc),
plugin_id=PluginId("vulnerability-remediation--node--npm"),
matched_scope="vulnerability-remediation/node/npm", specificity=3))
log.flush()
scorer = TrustScorer(event_log=log)
out = scorer.score([_sig(BUILD)])
assert out.confidence == "high"
# --- AC-16: stateless across calls -----------------------------------------
def test_score_is_stateless_across_calls(tmp_path: Path):
log = _log(tmp_path)
scorer = TrustScorer(event_log=log)
out1 = scorer.score([_sig(BUILD)])
assert out1.confidence == "high"
log.emit_internal(_ad()) # emit BETWEEN the two score() calls
log.flush()
out2 = scorer.score([_sig(BUILD)])
# A wrong impl that cached the degraded flag in __init__ returns "high"
# both times. The correct impl re-folds the log on each call.
assert out2.confidence == "degraded"
# --- AC-9: unregistered kind --------------------------------------------------
def test_unregistered_signal_kind_rejected(tmp_path: Path):
scorer = TrustScorer(event_log=_log(tmp_path))
bogus = SignalKind("not_registered_anywhere")
with pytest.raises(UnregisteredSignalKind) as excinfo:
scorer.score([TrustSignal(kind=bogus, passed=True, details={})])
assert excinfo.value.kind == bogus
# --- AC-10: empty signals ---------------------------------------------------
def test_empty_signals_rejected(tmp_path: Path):
scorer = TrustScorer(event_log=_log(tmp_path))
with pytest.raises(EmptySignals):
scorer.score([])
# --- AC-7: details rejects non-primitive values -----------------------------
@pytest.mark.parametrize("bad_value", [
["x"], # list
("a", "b"), # tuple
None, # None
b"bytes", # bytes
datetime.now(timezone.utc), # datetime
{"nested": "object"}, # nested dict
object(), # arbitrary object
])
def test_trust_signal_details_primitives_only(bad_value):
with pytest.raises(ValidationError):
TrustSignal(kind=BUILD, passed=True,
details={"k": bad_value}) # type: ignore[dict-item]
# --- AC-13: duplicate-name rejection carries both call sites ----------------
def test_register_signal_kind_rejects_duplicate_with_origin_payload():
fresh = SignalKindRegistry.fresh()
register_signal_kind("custom", registry=fresh)
with pytest.raises(SignalKindAlreadyRegistered) as excinfo:
register_signal_kind("custom", registry=fresh)
err = excinfo.value
assert err.name == SignalKind("custom")
assert err.existing # non-empty origin string for the first registration
assert err.duplicate # non-empty origin string for the second registration
assert err.existing != err.duplicate or "test_trust_scorer" in err.existing
# Message names both for human grep
assert "custom" in str(err)
# --- AC-14: per-test isolation via fresh() ---------------------------------
def test_fresh_registry_is_empty():
fresh = SignalKindRegistry.fresh()
assert SignalKind("build") not in fresh
assert SignalKind("anything") not in fresh
# --- AC-12: import-time registration is the registration mechanism ---------
def test_phase3_five_kinds_registered_at_import():
# Direct module-attribute assertion: BUILD === the SignalKind value the
# registry was asked to register. A wrong impl that exports the constants
# but skips the register() side effect would fail `BUILD in registry`.
for name, const in [
("build", BUILD), ("install", INSTALL), ("tests", TESTS),
("lockfile_policy", LOCKFILE_POLICY), ("cve_delta", CVE_DELTA),
]:
assert const == SignalKind(name)
assert const in signal_kind_registry
def test_signal_kinds_module_has_5_top_level_register_calls():
# AST-walk: the registrations must be MODULE-LEVEL calls (not inside a
# function that nothing invokes). Survives the mutation "moved into init()".
src = Path(importlib.import_module("codegenie.transforms.signal_kinds").__file__).read_text()
tree = ast.parse(src)
top_calls = [
n for n in tree.body
if isinstance(n, ast.Assign)
and isinstance(n.value, ast.Call)
and getattr(n.value.func, "id", None) == "register_signal_kind"
]
assert len(top_calls) == 5
def test_fresh_subprocess_import_populates_default_registry(tmp_path: Path):
# Strongest assertion of import-time registration: a *new* Python process
# importing codegenie.transforms must observe the 5 kinds — proves no
# other test's side effect is masking a missing module-level call.
script = textwrap.dedent("""
import codegenie.transforms # triggers transforms/__init__.py import side effects
from codegenie.transforms.signal_kinds import signal_kind_registry
from codegenie.types.identifiers import SignalKind
names = ["build", "install", "tests", "lockfile_policy", "cve_delta"]
missing = [n for n in names if SignalKind(n) not in signal_kind_registry]
assert not missing, f"missing: {missing}"
""")
result = subprocess.run([sys.executable, "-c", script], capture_output=True, text=True)
assert result.returncode == 0, result.stderr
# --- AC-15: pure helpers (functional core / imperative shell) ---------------
def test_pure_helpers_have_no_io_dependencies():
# AST-walk: _compute_strict_and and _has_adapter_degraded_for_workflow
# MUST NOT reference EventLog, replay, open, Path, os, or any I/O symbol.
src = Path(importlib.import_module("codegenie.transforms.trust_scorer").__file__).read_text()
tree = ast.parse(src)
forbidden = {"replay", "open", "Path", "os"}
for node in ast.walk(tree):
if isinstance(node, ast.FunctionDef) and node.name in (
"_compute_strict_and", "_has_adapter_degraded_for_workflow",
):
names = {n.id for n in ast.walk(node) if isinstance(n, ast.Name)}
attrs = {n.attr for n in ast.walk(node) if isinstance(n, ast.Attribute)}
assert not (forbidden & (names | attrs)), \
f"{node.name} references forbidden I/O symbol: {forbidden & (names | attrs)}"
def test_compute_strict_and_is_pure():
# Smoke test the pure helper independently of EventLog.
from codegenie.transforms.trust_scorer import _compute_strict_and
signals = [_sig(BUILD, True), _sig(INSTALL, False), _sig(TESTS, False)]
passed, failing = _compute_strict_and(signals)
assert passed is False
assert failing == [INSTALL, TESTS]
# --- Property test: confidence fold is exactly "any AdapterDegraded matching workflow_id" ---
@given(
events=st.lists(
st.tuples(
st.sampled_from(["adapter_degraded", "plugin_resolved"]),
st.sampled_from([WF, OTHER_WF]),
),
min_size=0, max_size=20,
),
)
def test_confidence_property_iff_matching_adapter_degraded(tmp_path: Path, events):
log = _log(tmp_path)
for i, (etype, wf) in enumerate(events):
if etype == "adapter_degraded":
log.emit_internal(AdapterDegraded(
event_id=EventId(f"01H..AD{i:02}"), workflow_id=wf,
timestamp=datetime.now(timezone.utc),
adapter="dep_graph", reason="x"))
else:
log.emit_internal(PluginResolved(
event_id=EventId(f"01H..PR{i:02}"), workflow_id=wf,
timestamp=datetime.now(timezone.utc),
plugin_id=PluginId("p"), matched_scope="s", specificity=1))
log.flush()
scorer = TrustScorer(event_log=log)
out = scorer.score([_sig(BUILD)])
expected = "degraded" if any(
etype == "adapter_degraded" and wf == WF for etype, wf in events
) else "high"
assert out.confidence == expected
# --- AC-20: no module-level mutable state outside the singleton -------------
def test_no_module_level_mutable_caches():
src = Path(importlib.import_module("codegenie.transforms.trust_scorer").__file__).read_text()
tree = ast.parse(src)
suspicious = []
for node in tree.body:
if isinstance(node, ast.Assign):
for tgt in node.targets:
if isinstance(tgt, ast.Name) and tgt.id.startswith("_"):
# private module-level binds are a smell; allow only
# explicit Final constants, not mutable containers.
if isinstance(node.value, (ast.Dict, ast.Set, ast.List)):
suspicious.append(tgt.id)
assert not suspicious, f"module-level mutable state in trust_scorer.py: {suspicious}"
Run; confirm ModuleNotFoundError. Commit the red marker.
Green — make it pass¶
Implement minimally:
- SignalKindRegistry is a dict[SignalKind, str] under the hood (the str is the module.qualname origin used for duplicate error messages). register(name, *, origin) returns the typed SignalKind for downstream use.
- register_signal_kind(name, *, registry=None) introspects its caller frame via inspect.stack() to derive origin so call-sites stay terse.
- TrustScorer.score is the imperative shell — empty-check, kind-validate, call _compute_strict_and, call _has_adapter_degraded_for_workflow(self._event_log.replay(), self._event_log.workflow_id), construct TrustOutcome. The two pure helpers carry the business logic.
Refactor — clean up¶
- Module docstrings cite ADR-0001 + ADR-0005 + Gap 5 + 05-ADR-0003.
scorereadsevent_log.replay()once per call (O(N) where N is the workflow's internal-stream event count). For Phase 3's ~50-events-per-workflow envelope this is sub-millisecond; Phase 5+ with more retries will need re-evaluation (left to that phase's perf bench S9-03). Do not cache the degraded flag on the scorer — AC-16 forbids it (the cache would break Phase 5's reuse-the-scorer pattern; ADR-0007).- Add a
__repr__toTrustOutcomethat does NOT includesignals(the list can be long; the repr should be one line for log readability). - Verify the AST fence (S1-05) catches any
dict[str, Any]indetailsor anywhere in this module. - Verify the AST purity test (AC-15) — pure helpers must not name
EventLog,replay,open,Path,os.
Files to touch¶
| Path | Why |
|---|---|
src/codegenie/transforms/trust_scorer.py |
New file — TrustScorer, TrustSignal, TrustOutcome, UnregisteredSignalKind |
src/codegenie/transforms/signal_kinds.py |
New file — SignalKindRegistry, register_signal_kind, Phase-3 registrations (BUILD, INSTALL, TESTS, LOCKFILE_POLICY, CVE_DELTA) |
src/codegenie/transforms/__init__.py |
Re-export TrustScorer, TrustSignal, TrustOutcome (per ADR-0001 §Consequences) |
tests/unit/transforms/test_trust_scorer.py |
New file — strict-AND 2^5, confidence propagation, cross-workflow filter, registry semantics |
Out of scope¶
- Phase 5's widening with
trace,policysignal kinds — Phase 5 (05-ADR-0003) lands the newregister_signal_kindcalls in its own module; zero edits to this story's code. - Phase 7's widening with
baseimage,shell_presence— Phase 7's distroless plugin lands them; zero edits here. - Per-signal confidence (where one signal is degraded but others are not) — Phase 3's
confidenceis whole-workflow; Phase 5+ may amend. TrustSignal.detailsschema validation per-kind —detailsis a free-form primitive dict; per-kind schemas are a Phase 5 amendment if needed.- Retry decision —
TrustScorerreturnsTrustOutcome; the orchestrator (S6-04) is the consumer; Phase 3 alone does NOT retry. Phase 5'sGateRunneris the retry envelope (ADR-0007). - Reading the spanning stream — confidence is folded from the internal stream only (
AdapterDegradedis an internal-stream variant per S6-01); the spanning stream is irrelevant here.
Notes for the implementer¶
- Constructor injection is not negotiable (Gap 5 fix). A reviewer might suggest "convenience":
TrustScorer.score(signals, *, event_log=None)with a default ambient lookup. Reject. The whole point of Gap 5 is that ambient state is unmockable, hides coupling, and breaks under concurrent workflows in the same process. The constructor argument is the contract. AC-2's forbidden-names list (from_env,from_ambient,current,for_workflow) is deliberately structural — anyone adding such a classmethod opens the door back to ambient-state. - 5th registry — the kernel-extract trigger is met, but the deferral still holds.
SignalKindRegistryis the 5th decorator-/function-call-registry in the codebase (probes/registry.py,indices/registry.py,depgraph/registry.py,plugins/registry.py, this one). The audit anchor atsrc/codegenie/plugins/registry.py:18-49makes the trigger explicit: "N=5 OR a registry needing only the common surface" — N=5 fires. But the four prior registries have divergent dispatch shapes (for_task+ LRU +sorted_for_dispatch;dispatch_all;has_strategyquery;resolve(scope)-with-extends-walk), andSignalKindRegistry's shape is the smallest (register+__contains__+fresh()— no dispatch). Extracting a sharedKernelRegistry[K, V]base would couple this minimal registry to the four heavyweight ones. Defer following the pattern set inindices/registry.py:26-31anddepgraph/registry.py:30-38. Reword the docstring's audit-anchor paragraph to bump the count to 5 and document the new deferral; the 6th registry's author has a clean grep trail. - InMemorySink for tests, disk for production. S6-01's hardened
EventLogacceptssink=InMemorySink()for tests; this story's TDD plan uses it to avoidtmp_pathzstd round-trips. Production code constructsEventLog(root=resolved_cache_dir, workflow_id=…)and gets the defaultZstdAppendingFileSink. Do NOT exportInMemorySinkoutside the test suite —__all__inevents.pycontrols the surface. transforms/__init__.pymust importsignal_kinds. The 5 module-level registrations only execute whensignal_kinds.pyis imported. If a consumer writesfrom codegenie.transforms import TrustScorerand never reaches intosignal_kinds, the registry would be empty andscore()would raiseUnregisteredSignalKindon the first BUILD signal. The fix is a one-lineimport codegenie.transforms.signal_kinds # noqa: F401intransforms/__init__.py— pin it with the subprocess-import test (AC-12).- The "mildly cyclical" note in
../phase-arch-design.md §Component design C6refers to the scorer reading the same workflow's event log it indirectly contributed to. This is fine —AdapterDegradedevents are written before Stage 6 (during bundle build), so the read is from a closed prefix. A test that confirms this (AdapterDegradedwritten →flush()→score()reads it) is the replay-tested guarantee. - The 5 module-level registrations in
signal_kinds.pyexecute at import time. This is the same shape asregister_plugin(Phase 3 ADR-0002 §Decision — function call, NOT a class decorator) — the import is the registration. Test discipline: tests that need to mutate the registry useSignalKindRegistry.fresh()and pass it explicitly toregister_signal_kind(name, registry=fresh). SignalKindAlreadyRegisteredandUnregisteredSignalKindare categorically different errors: the first is a configuration error at import time (two modules registered the same name); the second is a usage error atscoretime (a caller passed a kind no module registered). Don't conflate them under a singleSignalKindError.- The
details: dict[str, str | int | bool | float]constraint is primitives only (nolist[str]even though §C9 allows it on events). Rationale:detailsis consumed by humans reading theremediation-report.yaml; nested structures hurt scan-ability. If a signal genuinely needs a list, that's a sign the signal should be split into N signals. - The
confidence: Literal["high", "degraded"]field is not a sum type — it's a closed Literal. Two values, no payload. ADR-0010 §Decision (3)'s tagged-union pattern applies where the variants carry payload;confidencecarries none. Don't over-engineer. - The replay loop in
score()walksevent_log.replay()once per call. If a perf bench (S9-03) shows this is a bottleneck for long-running workflows, the right answer is to cache theAdapterDegradedcount on theEventLogitself (anemit_internalside-effect); do NOT cache on theTrustScorer. AC-16 pins this structurally — the scorer must remain stateless acrossscorecalls so the orchestrator can call it multiple times safely in Phase 5. - The
failinglist is ordered as input, not sorted (AC-4). Test discipline: preferout.failing == [...](list equality); useset(out.failing) == {...}only when documenting a deliberate order-agnostic case. The validator-strengthened parametrize test (AC-19) uses list equality so a sorted-failingimplementation FAILS. - The
scoremethod's signature matches Phase 5'sStrictAndGateadapter call site exactly (../../05-sandbox-trust-gates/final-design.md §6). Renaming the parameter fromsignalstoinputsor adding a kwarg breaks Phase 5; the contract snapshot test in S6-06 will catch it but reviewers should catch it first.