both)¶

Status: Deferred Date: 2026-05-11 Tags: llm · vendor Related: ADR-0002

Context¶

Leaf LLM nodes inside SHERPA-disciplined subgraphs call an Agents SDK to do the actual reasoning. The choice is which vendor SDK to use. The SHERPA discipline (ADR-0002) isolates this choice to leaf nodes only — the orchestration logic, state model, gates, and tool use are vendor-agnostic. The vendor concern surfaces only at the LLM API boundary.

Three positions:

Anthropic only. Claude Sonnet / Opus / Haiku. Prompt caching, citations, extended thinking, computer use, code execution all production-grade in 2026.
OpenAI only. GPT-4.x / Codex / O-series. Tool use, structured output, Assistants API.
Both, behind a thin shim. Implement a vendor-agnostic interface; route per task or per env config.

Options considered¶

Anthropic only. Reduces vendor surface area; Claude's prompt caching is well-suited to the few-shot RAG pattern in ADR-0011.
OpenAI only. OpenAI Codex's Temporal precedent (ADR-0003) suggests OpenAI tooling is production-ready for agentic workloads. Larger ecosystem of tutorials and integrations.
Vendor-agnostic shim. Lets us A/B test per task class without code changes; protects against vendor outages. Adds an internal interface to maintain.

Default until decided¶

Anthropic SDK for the initial leaf implementations. Reasoning:

Prompt caching directly benefits the RAG few-shot pattern (cache the matched solved example across retries).
Claude's extended thinking is well-suited to Stage 3 Planning (deliberate, structured).
The Anthropic SDK's tool-use shape composes cleanly with the SHERPA "tools at leaves" discipline.

The vendor-agnostic shim is the upgrade path — implement it behind the leaf interface so swapping or A/B testing later is a config change, not a refactor.

Evidence needed to resolve¶

Per-task evaluation. For each leaf node type (assessment, planning, autonomous execution, error triage, LLM judge), which vendor produces better outputs? Measured against post-merge incident rates and human-reviewer overrides.
Cost per evaluation. Token cost at the chosen model tier per leaf invocation.
Latency. P50 and P95 latency for the planned prompt+output sizes.
Reliability. Outage frequency, retry-after rate during peak hours.
Capability gaps. Does one vendor offer a feature (e.g., file-attachment, computer-use, citations) that materially improves a specific leaf?

Reversibility (of the eventual choice)¶

Low cost if the vendor-agnostic shim is in place from day one. Medium cost if leaves are coded directly against Anthropic's SDK and need migration. Recommendation: build the shim from day one, even if it's a thin layer.

Evidence / sources¶

../design.md §4.1 (leaf LLM calls inside SHERPA subgraphs)
../design.md §4.2 (SDK choice isolated to leaf nodes — composition lets us defer this)
../design.md §7 (Open questions — Agents SDK at the leaves)
Anthropic SDK 2026 documentation — prompt caching, extended thinking, tool use
OpenAI Codex precedent on Temporal (../../auto-agent-design.md §2.3)