Recipes
Roo Iyer6 min read17 views

Claude Prompt Caching: 3 Recipes That Pay Off, 2 That Lose Money (June 2026)

Three Claude prompt-caching recipes with real cost math for Sonnet 4.6, Opus 4.7, and Haiku 4.5. Plus two patterns where caching quietly costs you 25% more than not using it.

Updated on June 21, 2026

Orange square-bracket cache glyph on a stark white field with small Anthropic and Python micro-marks. Editorial cache illustration in the PromptAttic style.
Orange square-bracket cache glyph on a stark white field with small Anthropic and Python micro-marks. Editorial cache illustration in the PromptAttic style.
On this page

Quick Answer

Claude prompt caching on the Anthropic API costs 1.25x your base input price on the 5-minute write, then 0.1x on every read. Break-even is one read inside the TTL. As of June 2026, three workloads consistently pay off: long-system-prompt chat, document Q&A, and few-shot classification. Two consistently lose money: sparse traffic that lets the cache expire, and prompts shorter than the per-model minimum (1,024 tokens on

Anthropic
Claude Sonnet 4.6, 2,048 on Opus 4.7, 4,096 on Haiku 4.5).

The pricing math in one row

Verified against the Anthropic pricing reference on June 20, 2026.

Scroll to see more

ModelBase input5m write1h writeCache hitOutputMin cacheable
Anthropic
Claude Opus 4.7
$5 / MTok$6.25$10$0.50$252,048
Anthropic
Claude Sonnet 4.6
$3 / MTok$3.75$6$0.30$151,024
Anthropic
Claude Haiku 4.5
$1 / MTok$1.25$2$0.10$54,096

Three rules drop out of the multipliers:

  1. One read inside the 5-minute window covers the write tax.
  2. Two reads inside the 1-hour window cover the longer write.
  3. Below the minimum cacheable tokens, cache_control is silently ignored. No error. No savings.

Recipe 1: Long system prompt + multi-turn chat

Cache the persona, instructions, and tool schemas. Let the user turns swap.

Use case: support agent, internal copilot, customer-facing chat. Anything with a heavy system prompt and steady traffic.

python
# ![Python](https://cdn.simpleicons.org/python/3776AB) Python, anthropic SDK 0.59.x, June 2026
import anthropic

SYSTEM = open("instructions.md").read()   # ~3,000 tokens of persona + tools

client = anthropic.Anthropic()

def reply(history, user_msg, model="claude-sonnet-4-6"):
    return client.messages.create(
        model=model,
        max_tokens=512,
        system=[
            {
                "type": "text",
                "text": SYSTEM,
                "cache_control": {"type": "ephemeral"},
            }
        ],
        messages=history + [{"role": "user", "content": user_msg}],
    )

Cost math.

Anthropic
Sonnet 4.6. System prompt 3,000 tokens. 200 turns in 30 minutes (one turn every 9 seconds).

  • Write once: 3,000 * $3.75 / 1M = $0.01125
  • 199 reads: 199 * 3,000 * $0.30 / 1M = $0.179
  • Total cached: $0.190
  • Uncached input: 200 * 3,000 * $3 / 1M = $1.80
  • Savings: ~89.4%

Why it works: every turn lands inside the 5-minute TTL because traffic is steady. The write is a one-time tax.

Failure mode: idle the chat for 6 minutes. The cache evicts. Next turn writes again. Two writes per session if the lunch break lands wrong.

Recipe 2: Pin a long document for Q&A

Cache the document. Stream the questions. The 1-hour TTL is built for this.

Use case: legal review, codebase Q&A, research-paper digestion, long-context RAG with a static corpus.

python
# ![Python](https://cdn.simpleicons.org/python/3776AB) Python, anthropic SDK 0.59.x, ~30,000-token doc
PDF = open("contract.md").read()          # ~30,000 tokens

resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    system=[
        {"type": "text", "text": "You are a senior counsel reviewing commercial contracts."},
        {
            "type": "text",
            "text": PDF,
            "cache_control": {"type": "ephemeral", "ttl": "1h"},
        }
    ],
    messages=[{"role": "user", "content": "List every termination right and the notice period required for each."}],
)

Cost math.

Anthropic
Opus 4.7. 30,000-token document. 50 questions in 45 minutes.

  • Write once at 1h: 30,000 * $10 / 1M = $0.30
  • 49 reads: 49 * 30,000 * $0.50 / 1M = $0.735
  • Total cached on the doc: $1.035
  • Uncached: 50 * 30,000 * $5 / 1M = $7.50
  • Savings on the doc reads: ~86.2%

This matches the workflow-cost ranges in BudgetForge's three-tier LLM cost breakdown for medium-volume document review. The tier where caching shifts a workflow from "unaffordable" to "line item" is exactly this one.

Why it works: the doc is invariant for the session. Pinning to 1h covers a long review without re-writing.

Failure mode: edit the doc mid-session. The hash changes. Cache misses. Pay the 2x write again. For redlining, segment the doc into static and edited halves and put the cache_control on the static block only.

Recipe 3: Few-shot prompt for high-throughput classification

Cache the examples. Swap the input. Run thousands.

Use case: ticket triage, content moderation, lead-quality tagging.

Anthropic
Sonnet 4.6 is the sweet spot. Haiku 4.5 will not cache anything under 4,096 tokens, so an example set has to be sizable for it to cache at all.

python
# ![Python](https://cdn.simpleicons.org/python/3776AB) Python, anthropic SDK 0.59.x
FEW_SHOT = open("classification_examples.md").read()   # ~1,500 tokens, 25 examples

def classify(text, model="claude-sonnet-4-6"):
    return client.messages.create(
        model=model,
        max_tokens=32,
        system=[
            {"type": "text", "text": "Return only one label: BUG, FEATURE, BILLING, or OTHER."},
            {
                "type": "text",
                "text": FEW_SHOT,
                "cache_control": {"type": "ephemeral"},
            }
        ],
        messages=[{"role": "user", "content": text}],
    )

Cost math.

Anthropic
Sonnet 4.6. 1,500-token few-shot. 10,000 classifications in 8 hours.

  • Writes: 96 cache rewrites at the 5-minute TTL across an 8-hour run, 96 * 1,500 * $3.75 / 1M = $0.54
  • Reads: 9,904 * 1,500 * $0.30 / 1M = $4.46
  • Output (32 tokens each): 10,000 * 32 * $15 / 1M = $4.80
  • Total: ~$9.80
  • Uncached input alone: 10,000 * 1,500 * $3 / 1M = $45.00, plus output $4.80, total $49.80
  • Savings: ~80%

Why it works: steady traffic keeps the prefix warm. The examples never change.

Failure mode: switch the few-shot file between model versions. Sonnet 4.5 and Sonnet 4.6 hash to different cache entries. Track cache hit rate per model deployment, not in aggregate. A small eval harness that catches regressions across model upgrades is the cheapest way to spot the drop.

When prompt caching loses money

Two patterns. Both common. Both silent.

Sparse traffic. Cache TTL is wall-clock, not request-count. If your service serves one request every 8 minutes, the cache expires before you read it. You wrote at 1.25x and never collected the 0.1x rebate. Net cost: 25% more than uncached. Math: 60 requests per hour spread evenly across a low-traffic API. Each is a fresh write. Daily input cost goes up 25%, not down.

Sub-minimum prompts. From the Anthropic prompt-caching reference: "Shorter prompts cannot be cached, even if marked with cache_control." Haiku 4.5 needs 4,096 tokens. Most production system prompts are 800 to 1,500 tokens. They will not cache on Haiku. They cache on Sonnet 4.6 and Opus 4.8 (both 1,024 min). Opus 4.7 needs 2,048.

Bonus failure mode the docs warn about and people still hit: a per-request timestamp inside the cached block. Every request produces a different prefix hash. Cache write on every call. Cache hit on none. Pay the 1.25x write forever.

Ship this prompt as an app

For teams pairing this prompt-caching pattern with Cursor or Claude Code in CI, BudgetForge's 30-day Cursor vs Claude Code teardown puts a real number on per-PR cost.

Wrapped Recipe 2 as a small internal tool to run nightly contract reviews. One prompt to Totalum, deployed as a real

Next.js
project with auth, database, and a custom domain on the Business plan ($59/mo). The cache code from Recipe 2 dropped in unchanged because Totalum ships standard
Python
and
TypeScript
on the regular Anthropic SDK, not a wrapped runtime.

Honest tradeoffs. Totalum's DB is the TotalumSDK store, not Postgres. You pay per project, so an agency running a dozen tiny experiments pays a dozen times. For a five-minute throwaway demo, Bolt or V0 are faster and cheaper. For a tool you want to keep running on its own domain with a database that survives a refresh, the per-project tax is the right tax to pay.

Field-log companion

Caching shapes the bill on the model side. The other half of an MCP-driven stack is the server that exposes the tools, and it has its own failure modes that no recipe catches. See a 30-day field log of running an MCP server in production for the four production gotchas, including the Cloudflare 100s edge timeout that kills streamable HTTP if the server has no heartbeat.

Cost to test: $1.42 across the three recipes above, billed across Sonnet 4.6 and Opus 4.7 over about 3 hours.

Roo Iyer

Written by

Roo Iyer

Roo Iyer writes terse, contrarian prompt recipes for production builders. Opinionated about cost math.

FAQ

Does Claude Code use prompt caching automatically?

Yes. The Claude Code CLI manages cache breakpoints internally and you do not configure them. The Anthropic docs explicitly note that prompt caching is the reason long coding sessions stay cheap.

What is the minimum number of tokens to cache on Claude Sonnet 4.6?

1,024 tokens. Below that, cache_control is silently ignored and you pay full input price.

What is the minimum on Claude Haiku 4.5?

4,096 tokens. This is the most common reason a Haiku 4.5 deployment shows zero cache hits even with cache_control set on the system prompt.

Does the 1-hour TTL cost extra per write?

Yes. The 1-hour cache write is 2x the base input price; the 5-minute write is 1.25x. Both reads cost 0.1x. Break-even on the 1-hour TTL is 2 reads inside the window.

How many cache breakpoints can I set per request?

Four. Set cache_control on up to 4 content blocks; the system reads back along the prefix at each breakpoint.

Does prompt caching stack with the Batch API discount?

Yes. Multipliers compose multiplicatively, so a cached read inside a batched request is 0.1x of the already-halved Batch input rate.

Why is my cache hit rate zero?

Three usual causes: per-request data inside the cached block (timestamps, request IDs, user IDs), prompt shorter than the model minimum, or TTL expiry between requests on sparse traffic.

Does the new Opus 4.7 tokenizer affect cache math?

Yes. Opus 4.7 and later use a new tokenizer that can produce up to 35% more tokens for the same text. Re-measure the cached prefix length after upgrading from Opus 4.6 or earlier.