Skip to content

ADR-0020: Leaf Agents SDK choice (Anthropic / OpenAI / both)

Status: Deferred Date: 2026-05-11 Tags: llm · vendor Related: ADR-0002

Context

Leaf LLM nodes inside SHERPA-disciplined subgraphs call an Agents SDK to do the actual reasoning. The choice is which vendor SDK to use. The SHERPA discipline (ADR-0002) isolates this choice to leaf nodes only — the orchestration logic, state model, gates, and tool use are vendor-agnostic. The vendor concern surfaces only at the LLM API boundary.

Three positions:

  • Anthropic only. Claude Sonnet / Opus / Haiku. Prompt caching, citations, extended thinking, computer use, code execution all production-grade in 2026.
  • OpenAI only. GPT-4.x / Codex / O-series. Tool use, structured output, Assistants API.
  • Both, behind a thin shim. Implement a vendor-agnostic interface; route per task or per env config.

Options considered

  • Anthropic only. Reduces vendor surface area; Claude's prompt caching is well-suited to the few-shot RAG pattern in ADR-0011.
  • OpenAI only. OpenAI Codex's Temporal precedent (ADR-0003) suggests OpenAI tooling is production-ready for agentic workloads. Larger ecosystem of tutorials and integrations.
  • Vendor-agnostic shim. Lets us A/B test per task class without code changes; protects against vendor outages. Adds an internal interface to maintain.

Default until decided

Anthropic SDK for the initial leaf implementations. Reasoning:

  • Prompt caching directly benefits the RAG few-shot pattern (cache the matched solved example across retries).
  • Claude's extended thinking is well-suited to Stage 3 Planning (deliberate, structured).
  • The Anthropic SDK's tool-use shape composes cleanly with the SHERPA "tools at leaves" discipline.

The vendor-agnostic shim is the upgrade path — implement it behind the leaf interface so swapping or A/B testing later is a config change, not a refactor.

Evidence needed to resolve

  • Per-task evaluation. For each leaf node type (assessment, planning, autonomous execution, error triage, LLM judge), which vendor produces better outputs? Measured against post-merge incident rates and human-reviewer overrides.
  • Cost per evaluation. Token cost at the chosen model tier per leaf invocation.
  • Latency. P50 and P95 latency for the planned prompt+output sizes.
  • Reliability. Outage frequency, retry-after rate during peak hours.
  • Capability gaps. Does one vendor offer a feature (e.g., file-attachment, computer-use, citations) that materially improves a specific leaf?

Reversibility (of the eventual choice)

Low cost if the vendor-agnostic shim is in place from day one. Medium cost if leaves are coded directly against Anthropic's SDK and need migration. Recommendation: build the shim from day one, even if it's a thin layer.

Evidence / sources

  • ../design.md §4.1 (leaf LLM calls inside SHERPA subgraphs)
  • ../design.md §4.2 (SDK choice isolated to leaf nodes — composition lets us defer this)
  • ../design.md §7 (Open questions — Agents SDK at the leaves)
  • Anthropic SDK 2026 documentation — prompt caching, extended thinking, tool use
  • OpenAI Codex precedent on Temporal (../../auto-agent-design.md §2.3)