George Liu

ChatGPT Codex Flagged My Security Code. Here’s How OpenAI Trusted Access for Cyber Works

George Liu — Tue, 05 May 2026 12:24:32 GMT

I was in the middle of a Codex MacOS desktop app session on my Macbook Pro laptop, building a script to detect and analyze malware signatures across a set of files. Sixteen years of server infrastructure work means security tooling is part of the job. Nothing unusual about the task.

Then the Codex app flagged me.

A banner appeared at the bottom of my session: “This content was flagged for possible cybersecurity risk.” Pointing to chatgpt.com/cyber and OpenAI’s “Trusted Access for Cyber” program.

I was writing detection logic - the defensive side of security work. But OpenAI’s automated classifiers do not distinguish intent from content. If your prompts involve malware patterns, vulnerability scanning, or reverse engineering, the system treats you as a potential risk until you prove otherwise.

What Trusted Access for Cyber is

Trusted Access for Cyber is OpenAI’s identity and trust-based framework for cybersecurity professionals. OpenAI announced it in February 2026 alongside GPT-5.3-Codex, their most cyber-capable frontier reasoning model.

The problem it solves is ambiguity. “Find vulnerabilities in my code” could be responsible patching or exploitation reconnaissance. OpenAI’s models are trained to refuse clearly malicious requests and automated classifiers monitor for suspicious activity, but those mitigations also create friction for legitimate defensive work. Trusted Access for Cyber is designed to reduce that friction for verified users.

Individual users verify at chatgpt.com/cyber. Enterprises can request trusted access for their entire team through an OpenAI representative. Security researchers who need even more permissive models can apply to an invite-only program. Verified users must still follow OpenAI’s Usage Policies and Terms of Use - data exfiltration, malware creation or deployment, and destructive or unauthorized testing are still prohibited.

How verification works

Clicking “Start verification” on the chatgpt.com/cyber landing page hands you off to Persona, OpenAI’s third-party identity verification vendor. The flow is quick and formulaic: consent to biometric processing (data stored up to 30 days), select your country, photograph your government ID.

I was on my Macbook Pro so I opted to continue on my phone - Persona generates a QR code you scan, which opens the camera flow on your mobile device. No app download required. I photographed my Australian driver’s license, the system processed it, and within a couple of minutes I saw the confirmation: “Congratulations, you’re done.”

Back on my desktop, the chatgpt.com/cyber page updated to “You’re verified” with a green “Go to Codex” button. The fine print confirmed what trusted access covers: “legitimate, good-faith cybersecurity work, including finding and patching vulnerabilities, defensive attack-chain simulation, and vulnerability research. Use is limited to systems you own or are explicitly authorized to assess.”

From clicking “Start verification” to seeing “You’re verified” took less than five minutes. Most of that was switching to my phone.

What changes after verification

The constant flagging stops. Prompts that previously triggered warnings now process normally. You can work with malware analysis, detection scripts, vulnerability patterns, and binary analysis without the model refusing or adding disclaimers to every response.

For Codex users, this matters because Codex works against your local codebase. If you are building security tooling or writing detection logic, the pre-verification experience is a constant stream of interruptions. Post-verification, it works the way you would expect.

OpenAI also offers an invite-only program for security researchers and teams who need access to more capable or permissive models for legitimate defensive work. The verified access tier is the baseline for that pipeline.

Worth knowing before you verify

Persona collects your government ID and facial biometrics for liveness detection. Biometric data is stored for up to 30 days, account information for up to 7 days. OpenAI screens against international sanctions watchlists. For security professionals used to handling sensitive data, submitting government ID to a third-party vendor is a reasonable friction point to weigh - but the program is optional, and skipping it just means you keep hitting flags.

The trigger is content-based, not intent-based. Writing malware detection gets flagged the same way as writing malware. Verification is how you tell the system you are on the defensive side.

If you use Codex or ChatGPT for any security-related development - vulnerability scanning, threat detection, malware analysis, penetration testing on systems you own - you will likely hit this wall eventually. Getting verified early saves you the mid-session interruption.

The whole process took less time than writing this article about it. And my malware detection script? It works fine now.

If you’re interested in practical AI building for web apps, developer workflows, and infrastructure, subscribe for future posts. You can also follow my shorter updates on Threads (@george_sl_liu) and Bluesky (@georgesl.bsky.social) or subscribe and follow along.

DeepSeek V4 in Claude Code, Kilo Code, OpenCode: 3-Way AI Verification With GPT-5.5

George Liu — Sun, 03 May 2026 06:52:47 GMT

I never rely on a single AI for code analysis. I wrote about this when I built /consult-codex and /consult-zai - two Claude Code skills that fire parallel queries to Codex GPT-5.5 and ZAI GLM-5.1 for second opinions. That setup caught real bugs that no single model found alone.

DeepSeek V4 Pro gave me a reason to expand from two verifiers to three. It has 1M token context, an Anthropic-compatible API endpoint (so Claude Code speaks to it natively), and the pricing is hard to ignore: $0.435 per million input tokens with a 75% promotional discount running until May 31, 2026. That is roughly 4x cheaper than Claude Opus 4.7 for input.

So I built /consult-codex-deepseek - a skill that fires Codex GPT-5.5, DeepSeek V4 Pro, and a Sonnet 4.6 code-searcher agent in parallel and gives me a structured 3-way comparison. Three models, three perspectives, one prompt. Eventually, I’ll release this skill in both my Claude Code plugin marketplace centminmod/claude-plugins and in my Claude Code starter template GitHub repo.

This article covers three things: how to set up DeepSeek V4 in Claude Code, Kilo Code, and OpenCode; how the deepcc shell function lets you run DeepSeek as a separate Claude Code instance alongside your main session; and how the /consult-codex-deepseek skill wires it all together for 3-way verification.

What DeepSeek V4 offers

DeepSeek V4 ships two models through the same API:

DeepSeek V4 Pro is the flagship. 1M token context window, 384K max output, thinking mode enabled by default. It supports tool calls, JSON output, and FIM completion (in non-thinking mode). This is the model you want for code analysis and complex reasoning.

DeepSeek V4 Flash is the lightweight option. Same 1M context and feature set, but faster and significantly cheaper. Good for subagent tasks where you do not need the full reasoning depth.

Both models expose two API formats: OpenAI-compatible at https://api.deepseek.com and Anthropic-compatible at https://api.deepseek.com/anthropic. The Anthropic endpoint enables Claude Code integration without any code changes.

Pricing (as of May 2026)

DeepSeek V4 Pro is currently running a 75% discount extended until May 31, 2026. These are the discounted prices:

V4 Pro input (cache miss): $0.435 / 1M tokens
V4 Pro input (cache hit): $0.003625 / 1M tokens
V4 Pro output: $0.87 / 1M tokens
V4 Flash input (cache miss): $0.14 / 1M tokens
V4 Flash input (cache hit): $0.0028 / 1M tokens
V4 Flash output: $0.28 / 1M tokens

For context, Claude Opus 4.7 costs $15 / 1M input tokens (cache miss) and $75 / 1M output tokens. DeepSeek V4 Pro’s discounted input rate is roughly 34x cheaper. Even after the promo ends, the full price ($1.74 / 1M input) is still about 8.6x cheaper than Opus. At $0.003625 per million tokens, DeepSeek V4 Pro’s cached input is cheaper than Claude Haiku’s cached input.

Getting your DeepSeek API key

Before configuring any tool, you need an API key:

Go to platform.deepseek.com and sign up or log in
Navigate to API Keys
Create a new key and copy it immediately (you will not see it again)
Top up your balance - DeepSeek uses prepaid billing, not post-pay

Keep the key somewhere safe. You will use it across all three tools below.

Setting up DeepSeek V4 in Claude Code

Claude Code does not have a native DeepSeek provider. Instead, DeepSeek exposes an Anthropic-compatible API endpoint, so you redirect Claude Code’s API calls to DeepSeek’s servers using environment variables.

The standard approach (replaces Claude)

Set these environment variables before launching claude:

export ANTHROPIC_BASE_URL=https://api.deepseek.com/anthropic
export ANTHROPIC_AUTH_TOKEN=
export ANTHROPIC_MODEL=deepseek-v4-pro[1m]
export ANTHROPIC_DEFAULT_OPUS_MODEL=deepseek-v4-pro[1m]
export ANTHROPIC_DEFAULT_SONNET_MODEL=deepseek-v4-pro[1m]
export ANTHROPIC_DEFAULT_HAIKU_MODEL=deepseek-v4-flash
export CLAUDE_CODE_SUBAGENT_MODEL=deepseek-v4-flash
export CLAUDE_CODE_EFFORT_LEVEL=max

Then run claude as normal:

cd /path/to/your-project
claude

This works, but there is a trade-off to understand: this completely replaces Claude with DeepSeek. Every model selector in Claude Code (Opus, Sonnet, Haiku) now maps to a DeepSeek model. You are no longer talking to Claude at all.

That is fine if you want a pure DeepSeek session. But if you want to use DeepSeek as a second opinion alongside Claude - which is the more interesting use case - you need the shell function approach.

The `deepcc` shell function (runs DeepSeek alongside Claude)

I wrote a shell function called deepcc that wraps the claude command with DeepSeek’s environment variables. Add this to your ~/.zshrc (macOS) or ~/.bashrc (Linux):

deepcc() {
    export ANTHROPIC_BASE_URL=https://api.deepseek.com/anthropic
    export ANTHROPIC_AUTH_TOKEN=
    export ANTHROPIC_MODEL=deepseek-v4-pro[1m]
    export ANTHROPIC_DEFAULT_OPUS_MODEL=deepseek-v4-pro[1m]
    export ANTHROPIC_DEFAULT_SONNET_MODEL=deepseek-v4-pro[1m]
    export ANTHROPIC_DEFAULT_HAIKU_MODEL=deepseek-v4-flash
    export CLAUDE_CODE_SUBAGENT_MODEL=deepseek-v4-flash
    export CLAUDE_CODE_EFFORT_LEVEL=max
    claude "$@"
}

Reload your shell (source ~/.zshrc) and now you have two commands:

claude - launches Claude Code with Anthropic’s models as normal
deepcc - launches Claude Code pointing at DeepSeek V4 Pro

You can run them in separate terminal tabs, or - more usefully - call deepcc from inside a Claude Code session as a subprocess. That is exactly what the /consult-codex-deepseek skill does.

DeepSeek V4 Web Search In Claude Code

Using DeepSeek V4 Pro within Claude Code harness has some limitations in that there is no native web search. You will need to add Brave, Exa or Linkup MCP search servers to Claude Code to allow DeepSeek V4 models to do web searches.

For Linkup MCP, the free plan provides €5/month of free credits, and their documentation is set up for the Claude Desktop app, but not directly for Claude Code. However, you can install via this command, changing YOUR_API_KEY for your Linkup API key:

claude mcp add linkup npx -- -y linkup-mcp-server apiKey=YOUR_API_KEY

DeepSeek V4 Pro is failing to do web searches in Claude Code CLI.

With Linkup MCP search server enabled.

DeepSeek V4 Claude Code Token Usage Metrics

My session-metrics plugin for Claude Code will work with DeepSeek V4 token usage tracking.

/session-metrics:session-metrics export project to html

Exported Claude Code project level usage metrics for exports/session-metrics/project_20260503T063101Z_dashboard.html

Exported HTML metrics for exports/session-metrics/project_20260503T063101Z_detail.html.

Setting up DeepSeek V4 in Kilo Code

Kilo Code has native DeepSeek support, so setup is simpler than Claude Code. No environment variables needed.

Install Kilo Code CLI if you have not already:

npm install -g @kilocode/cli
kilo --version

Launch Kilo Code in your project:

cd /path/to/your-project
kilo

Type /connect in the command bar to open the Connect Provider panel.
Search for deepseek, select DeepSeek, and enter your API key.
Type /models to open the model selector and choose from:
DeepSeek V4 Pro
DeepSeek V4 Flash
DeepSeek Chat (legacy, maps to V4 Flash non-thinking mode)
DeepSeek Reasoner (legacy, maps to V4 Flash thinking mode)

That is it. Kilo Code handles the API routing internally. You can switch between DeepSeek and other providers without restarting.

Setting up DeepSeek V4 in OpenCode

OpenCode is an open-source coding assistant with terminal, web, and other interfaces. Like Kilo Code, it has native DeepSeek provider support.

Install OpenCode from opencode.ai/download. Make sure your version is >= v1.14.24 to avoid compatibility issues.
Launch OpenCode:

opencode

Type /connect in the input box, then enter deepseek and select the provider.
Enter your DeepSeek API key.
Select the DeepSeek-V4-Pro model.

OpenCode and Kilo Code both use the /connect pattern, which makes switching providers feel consistent. Neither requires you to manage environment variables manually.

The /consult-codex-deepseek skill: 3-way verification

This is where the pieces come together. I already had /consult-codex and /consult-zai for 2-way verification. The new /consult-codex-deepseek skill upgrades the pattern to 3-way: Codex GPT-5.5, DeepSeek V4 Pro, and Sonnet 4.6 code-searcher all analyze the same question in parallel.

Why three opinions instead of two

Two-way verification already catches things a single model misses. I documented this in Post 04 - during the Timezone Scheduler build, Codex caught request limit concerns that Claude alone did not flag, while the code-searcher found a caching optimization Codex overlooked.

Three-way verification adds a tiebreaker. When two models agree and one disagrees, you have a clear signal about which finding to trust. When all three disagree, you know you need to verify manually. The agreement level becomes a confidence metric.

It also gives you model diversity. GPT-5.5, DeepSeek V4 Pro, and Sonnet 4.6 have different training data, different architectural decisions, and different blind spots. A bug that all three independently flag is almost certainly real.

How the skill works

When you type /consult-codex-deepseek followed by a code question, the skill:

Wraps your question with structured output requirements - file paths with line numbers, confidence levels, limitations, and code snippets.
Fires all three analyses in parallel using a single message with multiple tool calls. No serial waiting. The three agents are:
Codex GPT-5.5 via codex -p readonly exec (OpenAI’s Codex CLI in readonly mode)
DeepSeek V4 Pro via deepcc --bare --print (the shell function from above)
Sonnet 4.6 code-searcher via Claude Code’s built-in Agent tool with subagent_type: "code-searcher"
Produces a structured comparison with a 3-column table covering file paths, line numbers, code snippets, unique findings, accuracy, and strengths for each model.
Assigns an agreement level: High Agreement (all three converge - ship with confidence), Partial Agreement (overlapping findings with unique additions - investigate the differences), or Disagreement (contradicting findings - manual verification required).
Synthesizes the best insights from all three into a unified analysis, prioritizing findings corroborated by multiple agents with specific file:line citations.

The parallel invocation

The skill writes a prompt file for each external CLI tool and launches all three simultaneously. Here is what the Codex and DeepSeek invocations look like (simplified):

# Codex GPT-5.5 (readonly mode)
zsh -i -c 'codex -p readonly exec "$(cat $CLAUDE_PROJECT_DIR/tmp/codex-prompt.txt)" --json 2>&1'

# DeepSeek V4 Pro (via deepcc shell function)
zsh -i -c 'deepcc --bare --print "$(cat $CLAUDE_PROJECT_DIR/tmp/deepseek-prompt.txt)" \
  --allowedTools "Bash,Read,Edit" --add-dir "$CLAUDE_PROJECT_DIR" 2>&1'

The zsh -i (or bash -i on Linux) is required to load the interactive shell, which is where deepcc is defined. The code-searcher runs natively inside Claude Code using the Agent tool, so it does not need a shell wrapper.

What the output looks like

After all three agents finish (typically 30 to 90 seconds depending on codebase size and query complexity), you get output structured like this:

## Codex (GPT-5.5) Response
[Full analysis with file:line citations]

## DeepSeek (V4 Pro) Response
[Full analysis with file:line citations]

## Code-Searcher (Claude) Response
[Full analysis with file:line citations]

## Comparison Table
| Aspect          | Codex (GPT-5.5) | DeepSeek (V4 Pro) | Code-Searcher |
|-----------------|-----------------|-------------------|---------------|
| File paths      | Specific        | Specific          | Specific      |
| Line numbers    | Provided        | Provided          | Provided      |
| Unique findings | [details]       | [details]         | [details]     |

## Agreement Level
High Agreement / Partial Agreement / Disagreement

## Synthesized Summary
[Best insights from all three, prioritized by corroboration]

Prerequisites for the skill

To use /consult-codex-deepseek, you need all three backends configured:

Claude Code with your Anthropic subscription (runs the main session and code-searcher agent)
OpenAI Codex CLI installed globally: npm install -g @openai/codex with OPENAI_API_KEY set
The deepcc shell function in your ~/.zshrc or ~/.bashrc with your DeepSeek API key

The skill itself is a Claude Code custom skill. If you are not familiar with creating custom skills, my dual-AI consultation article covers the basics.

What to watch out for

DeepSeek rate limits are dynamic

DeepSeek does not publish fixed rate limits. Instead, they dynamically limit concurrency based on server load. When you hit the limit, you get an immediate HTTP 429 response. During high-traffic periods, this can happen more frequently than you would expect from Claude or OpenAI.

If a request has not started inference after 10 minutes, the server closes the connection. For agentic use cases where DeepSeek needs to reason through complex code, this timeout can occasionally bite. The consultation skill handles this gracefully - if DeepSeek times out, it still presents the Codex and code-searcher results and notes the failure.

Thinking mode is on by default

DeepSeek V4 models default to thinking mode enabled. This means the model reasons through problems before responding, similar to Claude’s extended thinking. For code analysis this is usually what you want. If you need non-thinking mode (faster, cheaper), check DeepSeek’s thinking mode guide for how to toggle it.

Error codes to know

The error codes you are most likely to hit:

401 - Wrong API key. If you are getting this from a subprocess, check whether OAuth tokens are bleeding through (use --bare).
402 - Insufficient balance. DeepSeek uses prepaid billing. Top up at platform.deepseek.com/top_up.
429 - Rate limit. Wait and retry, or reduce concurrent requests.
503 - Server overloaded. Common during peak hours. Retry after a brief wait.

The `--bare` gotcha (worth repeating)

If you are calling deepcc from inside a Claude Code session and getting 401 errors despite having the correct API key, you almost certainly need --bare. This was the most time-consuming debugging issue I hit during setup. The parent session’s OAuth token silently overrides your DeepSeek API key in the subprocess environment. --bare prevents this.

When to use which tool

Claude Code + deepcc is the power-user setup. You get DeepSeek alongside Claude in the same terminal workflow, and you can automate multi-model consultation with skills. The trade-off is more configuration (shell functions, env vars, understanding --bare).

Kilo Code is the easiest path if you just want to try DeepSeek V4. Native provider support, no env vars, switch models with /models. Good for evaluation and side-by-side comparison with other providers it supports.

OpenCode is similar to Kilo Code in ease of setup but is open-source. If you care about inspecting how the tool talks to the API or want to extend it, OpenCode is the better choice.

For my workflow, I use Claude Code as my primary tool with Opus 4.7, and deepcc as one leg of the 3-way verification skill. I use Kilo Code and OpenCode when I want to run a pure DeepSeek session without Claude Code’s environment variable overhead.

What I learned

The Anthropic-compatible endpoint is the key enabler. DeepSeek exposing api.deepseek.com/anthropic means any tool that talks to the Anthropic API can talk to DeepSeek with zero code changes. Just swap the base URL and auth token.

Three-way verification catches more than two-way. The tiebreaker dynamic is genuinely useful. When Codex and DeepSeek agree but the code-searcher disagrees, I investigate the code-searcher’s reasoning (it often has better file-level context). When all three agree, I ship with higher confidence.

The --bare flag is essential for subprocess use. This is not documented anywhere in DeepSeek’s integration guide. If you are building skills or automation that call DeepSeek from inside a Claude Code session, --bare is the difference between it working and getting mysterious 401 errors.

DeepSeek V4 Pro’s reasoning is slow but thorough. Expect 30 to 90 seconds for complex code questions with thinking mode on. The consultation skill runs all three agents in parallel, so DeepSeek’s slower response time does not bottleneck the workflow - you just wait for the slowest agent.

Pricing makes multi-model verification practical. At $0.435 per million input tokens (with the promo), adding DeepSeek as a verification layer costs almost nothing relative to the Claude Opus session it is verifying. A typical consultation query costs a few cents on the DeepSeek side.

I Saved $7,189 on Claude Code Tokens. Here’s Every Efficiency Habit That Mattered

George Liu — Sat, 02 May 2026 04:49:40 GMT

I’ve spent the last month running controlled benchmarks across Claude Code sessions. Not casual usage. Instrumented runs with token tracking, cost breakdowns, cache hit rates, and instruction-following scores across four models and five effort levels.

Most “tips and tricks” articles for Claude Code read like a feature changelog someone reformatted into listicle form. They throw 30 items at you with no ranking, no cost data, and no indication of what actually moves the needle versus what sounds impressive but saves you nothing.

This is the opposite. Everything here is ranked by real impact, measured in tokens saved, dollars not spent, or bugs avoided. I’ll show you what I found by running my Claude Code session-metrics plugin across hundreds of sessions, what an Anthropic engineer publicly confirmed, and what the official documentation says.

The #1 mistake killing your token budget: one session for everything

The single worst habit I see from new Claude Code users: they open one session and never leave it. Bug fix? Same session. New feature? Same session. Quick question about an unrelated file? Same session. They rely on auto-compacting to keep things running when the context fills up.

This is the biggest killer of both token efficiency and session limits. Here’s why.

Every message you send in a session grows the context. Every file Claude reads stays in the context. When you cram unrelated tasks into one session, the context fills with information from Task A that is pure noise during Task B. Claude reads all of it on every turn. You’re paying tokens for the model to process your bug fix discussion while working on a feature, and your feature discussion while debugging something else.

The most common reason users state for using the single same chat session is that new sessions don’t have context for what they have done. If you properly build context management around the project, Claude Code should be able to remember even in fresh chats. That’s what I do I have CLAUDE.md memory bank system with reference files progressively disclosing information including CLAUDE-activeContext.md which tracks what I am doing right now. I can start a fresh chat and just ask what the next task on my list is or what I did last time. Hint: asking Claude to inspect your git commit history can also help it understand what you did in the past.

Context rot: the silent quality killer

This is where context rot sets in. It’s not just about cost - it’s about quality degradation. As the context fills with stale information from earlier tasks, Claude starts making worse decisions.

Thariq from Anthropic confirmed publicly (also as a blog post) that context rot starts around 300-400K tokens. Not at the end of the 1M window. Not when compaction fires. At 300-400K. If you’re cramming multiple tasks into one session, you’re hitting the degradation zone far earlier than you think.

The symptoms are predictable: Claude suggests solutions you already rejected three tasks ago. It contradicts itself because it’s trying to reconcile constraints from Task A with requirements from Task C. It ignores things you told it explicitly because those instructions are now buried under hundreds of unrelated messages.

The insidious part: context rot is gradual. You don’t notice a cliff. You notice that Claude’s suggestions get slightly less accurate turn by turn. Each response is a little more generic, a little less aware of the specific state of your code. By the time you realize quality has degraded, you’ve already wasted 20-30 turns (and their associated tokens) getting subpar output.

Why auto-compacting doesn’t save you

Auto-compacting doesn’t fix context rot - it makes it worse in some ways. Here’s the critical detail from Thariq: the model is at its least intelligent when auto-compacting fires. Auto-compact triggers when context is nearly full, which means Claude is already suffering maximum context rot when it tries to summarize your session. It’s making decisions about what to keep and what to drop while at its lowest performance point.

Compacting summarizes, but summaries lose nuance. The summary of your bug fix still sits in context during your feature work, creating noise. And each compaction cycle means Claude’s understanding of the current task gets more diluted. You end up with a context that’s a muddy blend of compressed historical tasks rather than a clean focus on the current problem.

The token math

A session with 3 unrelated tasks crammed together might hit 200K context by the end. Three separate sessions of 60-70K each would have done the same work with less total context processed per turn, better cache hit rates (because each session has a stable prefix), and no cross-contamination between tasks.

On the Max plan, this habit also burns through your 5-hour rolling usage limit faster. Every turn processes the full bloated context, so you hit the limit in fewer turns than you would with clean, focused sessions.

The /resume trap

The same problem applies to --resume. People treat it as a way to “pick up where they left off,” but what they’re actually doing is loading a stale context with an expired cache. My session-metrics data shows these resumed sessions hit cache misses immediately - the entire context needs to be rebuilt from scratch because the cache TTL expired while you were away. You pay the full rebuild cost and inherit all the context rot from the previous session’s unrelated work.

Resume has legitimate uses: continuing a specific task you paused briefly (within the cache TTL window). But using it as a general-purpose “continue my day” command is just the single-session antipattern with extra steps.

The fix

New task, new session. Every time. This isn’t just my opinion - Thariq states it as Anthropic’s internal rule of thumb: “When you start a new task, you should also start a new session.”

The one exception: related follow-up tasks where context is still necessary. Writing documentation for a feature you just implemented is fine in the same session - Claude would have to re-read all those files anyway. But “build feature, then fix an unrelated bug, then document the feature” is not - the bug-fix context pollutes both phases.

The single biggest lever: plan mode

Most guides rank context management as the #1 positive habit. I did too, until I looked at my own data and realized plan mode is the root cause of my 97.5% cache hit rate. Context discipline maintains efficiency, but plan mode creates the conditions for caching to work in the first place.

Claude Code session-metrics plugin HTML exported metrics for project where 82% of my tokens land in 1hr cache TTL bucket. Made 14 Claude Advisor calls and only 11 out of 143 sessions involved session resumption

The mechanism: how plan mode drives cache savings

When you start a session in plan mode and front-load your context (read the relevant files, establish the plan, define the constraints), that entire context gets written to cache once. Every subsequent execution turn reads from that cached prefix. The plan establishes a stable context pattern that rarely changes turn-to-turn, which is exactly what the cache system rewards.

Without plan mode, most people build context incrementally - read a file here, ask a question there, add a constraint after seeing bad output. That incremental approach means the cache prefix keeps changing, which means fewer cache hits and more expensive turns.

The numbers from my session-metrics data across 143 sessions and 13,445 turns on one project:

Cache hit rate: 97.5%
Actual cost: $1,369
Cost without cache: $8,558
Savings from cache: $7,189 (84% cost reduction, 6.3x multiplier)
Cache read-to-write ratio: 40:1

In one 507-turn session, my plan file was referenced 12 times and my main source file was read 200 times from turn 1 onwards. Because I established both during planning, they were cached from the beginning. The ratio of cache reads to fresh input in that session was 11,874:1.

My plan mode workflow

I never skip plan mode. Not for new features, not for bug fixes, not for refactors. Every task starts there. My standard plan-mode prompt includes something like: “Think critically and concisely. For all relevant tech and software used, look up Context7 MCP for the latest documentation and best practices. Also check the official docs for the tech stack to ground your understanding in current information.”

Why the documentation lookups? Because every AI model has a training cutoff - often 6 to 12 months out of date. If you let Claude plan a feature using its training knowledge alone, it might reference deprecated API flags, outdated config variables, or library versions that have breaking changes. I’ve seen this happen enough times that I now treat plan mode as a research phase, not just a thinking phase.

The workflow:

Enter plan mode (Opus)
Describe what I want built, with the instruction to verify against current docs using Context7 MCP and official sites’ documentation
Claude produces a plan grounded in the latest documentation it looked up
I review the plan, push back on anything that looks wrong or outdated
We iterate on the plan until I’m satisfied everything is current and correct
Only then do I approve execution

This back-and-forth costs relatively little. A few thousand tokens of planning and doc lookups. Compare that to building an entire feature on outdated assumptions, discovering it doesn’t work, and spending 3x the tokens to redo it.

In one project, Claude confidently planned a feature using a Remotion API method that had been renamed two versions ago. The Context7 lookup during planning caught the discrepancy immediately. Without it, I’d have burned 15-20 minutes debugging before realizing the plan itself was wrong. This applies to any fast-moving tech stack: Next.js, Tailwind, Docker, cloud provider APIs.

Front-load constraints, not just goals

Most prompts describe what to build. Better prompts also describe what not to do:

Add authentication using the existing JWT pattern in auth.ts.
No new dependencies. Follow the error handling pattern already in the codebase.
Use the UserRepository, don't query the database directly.

Constraints given upfront produce better first drafts than corrections given after the fact. Every correction round wastes context.

Context management: maintaining what plan mode creates

Plan mode sets up your cache. The habits below maintain it. The documented constraint from Anthropic’s best-practices page: “Most best practices are based on one constraint: Claude’s context window fills up fast, and performance degrades as it fills.”

/rewind: erase failed attempts instead of correcting

This is the most underused command in Claude Code. Thariq calls it the single best habit that signals good context management. Here’s why.

When Claude tries something and it doesn’t work, most people type “that didn’t work, try this instead.” That correction and Claude’s failed attempt both stay in your context permanently, adding noise to every future turn.

The better move: hit Esc Esc (or type /rewind) to jump back to before the failed attempt. Re-prompt with what you learned: “Don’t use approach A, the foo module doesn’t expose that - go straight to B.” The failed attempt is erased from context entirely.

A single failed approach might add 5-10K tokens of dead-weight context (Claude’s reasoning, the tool calls it made, the error output). Over a session with 3-4 wrong turns, that’s 20-40K tokens of noise that /rewind would have prevented.

I use rewind aggressively. If Claude’s first approach is clearly wrong within the first few tool calls, I rewind immediately rather than letting it finish. The moment I see it heading in the wrong direction, Esc Esc and re-prompt with a better constraint.

You can also use “summarize from here” before rewinding - Claude summarizes its learnings from the failed attempt, creating a handoff message to itself. Then you rewind and include that summary in your new prompt. You get the insight from the failed attempt without the context bloat.

/compact: proactive, not reactive

Most people hit /compact when Claude starts giving degraded responses. That’s too late - the model is already at its worst when compaction fires at high context usage.

The better habit: run /compact after completing a logical chunk of work. Finished a feature? Compact. About to start a new module? Compact. Guide what gets preserved:

/compact Focus on the API changes and the test failures we fixed

When I’m running a multi-phase session where I need to carry context forward to the next task, I use:

/compact remember in detail context needed for next task(s)

This tells the compaction to prioritize preserving the state, decisions, and constraints that the next phase needs - rather than just summarizing what happened. A generic compact might drop specific file paths or config values you’ll need next. Directing it to “remember in detail” keeps that operational context intact while still shedding the conversational noise.

You can also add a rule to your CLAUDE.md file like: “When compacting, always preserve the full list of modified files and any test commands.” This ensures critical context survives summarization.

There’s also a PostCompact hook to automatically re-inject specific context after every compaction:

{
  "hooks": {
    "SessionStart": [
      {
        "matcher": "compact",
        "hooks": [
          {
            "type": "command",
            "command": "echo 'Reminder: use Bun, not npm. Run bun test before committing.'"
          }
        ]
      }
    ]
  }
}

/compact vs /clear: know when to use which

These feel similar but behave differently:

/compact asks the model to summarize the conversation, then replaces history with that summary. It’s lossy - you’re trusting Claude to decide what mattered. You can steer it with instructions, but you can’t fully control what gets kept.
/clear wipes the slate. You write down what matters yourself: “we’re refactoring the auth middleware, the constraint is X, the files that matter are A and B, we’ve ruled out approach Y.” More work, but the resulting context is exactly what you decided was relevant.

After a focused session with a clear direction, /compact works well. After a messy debugging session with false starts, /clear with your own brief is safer because you know which dead ends to exclude.

Cache TTL and session rhythm

The cache TTL depends on your plan. The standard API tier gets a 5-minute ephemeral cache - step away for 10 minutes and the cache is gone. On the Max plan ($100/month), you get a 1-hour cache TTL.

My typical sessions run 5 to 12 hours continuous. I’m sending turns constantly, rarely leaving gaps longer than the 1-hour TTL. In 507 turns across one session, I had exactly one cache break - after a 512-minute idle gap (8.5 hours, overnight). Every normal working pause (testing manually, reading output, reviewing docs) stayed within the 1-hour window.

This rhythm matters: continuous sessions with plan-mode-established context is the ideal pattern for cache efficiency. If you work in short bursts with long gaps, you’ll hit cache rebuilds more often regardless of your plan tier. If you’re on the 5-minute TTL, any pause over 5 minutes forces a full rebuild - wrap up and commit before stepping away.

Claude Code session-metrics plugin HTML exported detailed turn-by-turn, session-by-session token usage metrics showing predominantly input token cache reads after turn #1 and very little new input tokens after turn #1

Over 143 sessions in this Claude Code project, I only had 25 token cache breaks according to my Claude Code session-metrics plugin exported HTML dashboard. For the highest cache break, it was related to start of a new session #109 for turn #9167 which is a normal expected cache break. The session-metrics plugin will allow you to pin-point where your cache breaks are.

Clicking on turn #9167 in session #109 reveals insights.

Shift usage to off-peak hours

Update: May 7, 2026: Claude AI has removed peak hour reduced limits and doubled usage limits under an agreement to use SpaceX compute. With Opus 4.7, Anthropic reduced the 5-hour rolling usage limits during peak times. You can hit rate limits faster during peak hours (US Pacific time) even on paid plans during weekdays. Anthropic announced peak times as being between 5am–11am PT / 1pm–7pm GMT. I created my Timezones Scheduler app so I can figure out timezone conversions. Check it out at https://timezones.centminmod.com. I’ve moved most of my Claude Code work to off-peak hours - early mornings, evenings, and weekends in my timezone (AEST).

Thariq from Anthropic stated on Twitter:

To manage growing demand for Claude we're adjusting our 5 hour session limits for free/Pro/Max subs during peak hours. Your weekly limits remain unchanged.

During weekdays between 5am–11am PT / 1pm–7pm GMT, you'll move through your 5-hour session limits faster than before.

My session-metrics data confirms: 63% of my cost now happens outside business hours. Off-peak usage means fewer interruptions from throttling, longer unbroken sessions, and more consistent throughput. Combined with the 1-hour cache TTL, off-peak continuous sessions are the most efficient pattern I’ve found.

I wrote more about adapting to the Opus 4.7 changes (auto mode, recaps, the xhigh default) in my workflow tips article.

Claude Code session-metrics plugin HTML exported dashboard showing my user messages by time of day - moving to Claude Code off-peak usage times

Claude Code session-metrics plugin HTML exported dashboard showing my weekday x hour matrix

/btw for side questions

The /btw command lets you ask a quick question without it entering your conversation history. The answer appears in a dismissible overlay. You check a detail, it disappears, and your context stays clean.

Every question you ask normally adds both your message and Claude’s response to the context. Over a long session, those side questions accumulate. /btw eliminates that overhead for anything that’s informational rather than directional.

Sub-agents for heavy exploration

When Claude researches a codebase, it reads lots of files that consume your main context. Sub-agents run in their own context windows and report back summaries.

The mental test (from Thariq): “will I need this tool output again, or just the conclusion?” If you only need the conclusion, spin it off to a sub-agent. The intermediate output (50 file reads, debugging traces, exploration) stays in the sub-agent’s context and never pollutes yours.

Practical prompts that work:

“Spin up a subagent to read through this other codebase and summarize how it implemented the auth flow, then implement it yourself in the same way”
“Spin off a subagent to verify the result of this work based on the following spec file”
“Spin off a subagent to write the docs on this feature based on my git changes”

The effort knob: your biggest cost control

The /effort command sets how hard Claude thinks on each response. The levels are: low, medium, high, xhigh, and max. Most people never touch this, which means they’re paying max-effort prices for formatting fixes.

I measured this directly. In my Opus 4.6 vs 4.7 comparison, running the same 10 prompts at different effort levels showed Opus 4.7 at xhigh costs 2.17x more than Opus 4.6 at high. The newer model isn’t just more capable - it’s dramatically more expensive when you don’t control the effort level. Just adding the instruction to be ‘concise’ to Claude Opus 4.7 can reduce token usage by 29%!

And prompting impacts on share of input token cache writes on Turn 1 in a session.

Match effort to task complexity

Low effort: typo fixes, simple renames, formatting changes
Medium effort: straightforward code changes with clear specs
High effort: multi-file features, refactoring with dependencies
xhigh/max effort: architecture decisions, complex debugging, algorithm design

Set this per-prompt (“use low effort for this”), per-session (/effort low), or globally via the CLAUDE_CODE_EFFORT_LEVEL environment variable.

In practice, most of my day is medium-effort work. I was defaulting to high for everything and paying roughly 40% more than I needed to. The moment I started explicitly downshifting to medium for routine work, my daily cost dropped noticeably without any quality impact.

Thinking keywords

Beyond /effort, you can use thinking keywords directly in your prompt: think, think hard, think harder, ultrathink. Each pushes Claude to use more extended reasoning tokens. Toggle thinking on/off mid-session with Option+T (Mac) or Alt+T, or limit the budget with MAX_THINKING_TOKENS.

Reserve ultrathink for problems where a mediocre answer costs you an hour of debugging. In my four-model benchmark, ultrathink on Opus 4.7 produced 606 tokens of thinking for a task with a 200-token output limit - and then violated the limit anyway. More thinking doesn’t always mean better instruction-following. Thinking tokens correlate with reasoning depth, not with compliance to explicit output rules.

The practical takeaway: think or think hard for moderately complex tasks. ultrathink only when the problem genuinely requires deep multi-step reasoning.

CLAUDE.md: write it once, benefit every session

CLAUDE.md is loaded into Claude’s context at the start of every session automatically. It’s the right place for anything you’d otherwise type repeatedly: build commands, code style rules, architecture constraints, project conventions.

The official guidance: “For each line, ask: ‘Would removing this cause Claude to make mistakes?’ If not, cut it.”

Keep it focused

A bloated CLAUDE.md dilutes your most important rules. Only include things specific to this codebase. If a rule applies to every project (“write tests”), it doesn’t belong. If a rule is specific (“all database queries go through the repository layer”), it does.

The docs warn: “If Claude keeps doing something you don’t want despite having a rule against it, the file is probably too long and the rule is getting lost.”

The multi-model caveat

One caveat on the “keep it lean” advice. That guidance assumes you only run Claude models. I also run ZAI GLM 5.1 inside Claude Code. GLM 5.1 is capable, but at the brevity level where Claude handles progressive disclosure well, GLM 5.1 hits maybe 80% instruction-following accuracy. To close that gap to 90-95%, I had to add more explicit steering and guardrails in my CLAUDE.md and SKILL.md files.

The unexpected side effect: those detailed instructions also appear to help Claude on days when developers report degraded Opus 4.6 performance. More explicit instructions leave less room for drift regardless of which model reads them. It’s a theory, but the correlation has been consistent enough that I trust it.

The real advice: keep CLAUDE.md as short as possible for the least capable model you run through it. If you only use Claude, lean is fine. If you run multiple models, the extra tokens may pay for themselves in consistency.

I wrote more about this in my session-metrics article.

I’ve now added DeepSeek V4 models to Claude Code usage mix too, so will see how this changes my CLAUDE.md.

File hierarchy and version control

Claude Code supports multiple CLAUDE.md files: ~/.claude/CLAUDE.md (global), ./CLAUDE.md (project root, check into git), ./CLAUDE.local.md (personal, gitignored), and subdirectory files for monorepo packages. Import other files with @path/to/file syntax to avoid duplicating content.

Commit your CLAUDE.md alongside the code changes that motivated it. Future team members benefit, and you can trace why certain rules exist.

Parallel development with worktrees

Git worktrees let you run multiple Claude Code sessions on separate branches simultaneously without conflicts.

claude --worktree feature-auth

Each worktree is an isolated working directory. Claude in one worktree doesn’t know or care what Claude in another is doing. I typically run two: one for the main feature I’m building, another for a review/test pass. The separation means the review session doesn’t have the implementation context biasing it toward approving code it “remembers” writing.

You can also have one Claude write tests first, then another write code to pass them. The separation of context prevents the second session from gaming the tests.

More slash commands worth knowing

/debug

When something isn’t working in your Claude Code session, /debug surfaces configuration issues, context problems, and tool availability without you having to diagnose manually.

/context

Shows everything loaded in your current context window: system prompt, memory files, skills, MCP tools, conversation messages. Use this first when Claude seems to be ignoring instructions - the issue is usually that a file didn’t load or got overridden.

/simplify

Claude tends to over-engineer. Ask for a data fetch and you might get a full abstraction layer with retry logic and caching. /simplify tells Claude to produce the simplest working version.

/batch

Groups multiple similar small requests into one efficient pass. Ten independent tasks (fix a typo, rename a variable, add a comment) batched together beats ten separate sessions.

Non-interactive mode for automation

With claude -p "your prompt", you run Claude without a terminal session - useful for CI pipelines, pre-commit hooks, or automated workflows. Add --output-format json for structured output or --bare for minimal mode that skips auto-discovery.

All 220 sessions in my effort benchmark were claude -p calls with structured JSON output, piped into my session-metrics analysis pipeline. It’s the only way to run controlled experiments at that scale.

MCP servers: the hidden context tax

Every MCP server you have enabled gets its tool descriptions included in Claude’s context on every request. In sessions with 8+ MCP servers connected, I consistently see 15-20% more base context consumption before I even send my first message.

The practical fix: audit your MCP server list periodically. If you haven’t used a server in the last week, disable it. The context savings compound across every message in every session.

The docs recommend CLI tools as “the most context-efficient way to interact with external services.” The gh CLI, for instance, adds zero context overhead until you invoke it. An MCP server adds overhead to every turn whether you use it or not.

Verification: let Claude check its own work

The official docs state it plainly: “If you can’t verify it, don’t ship it.”

Always give Claude a way to confirm correctness - run tests, compare screenshots, validate output formats. I structure prompts to include verification: “Implement this feature, then run the test suite and fix anything that fails.” Slightly more tokens per turn, dramatically fewer correction cycles overall.

What I measured across 220 sessions

In my four-model effort benchmark, I ran 220 Claude Code subprocess sessions across Opus 4.7, Opus 4.6, Opus 4.5, and Sonnet 4.6 at every effort level. The data revealed patterns that aren’t intuitive:

Sonnet 4.6 delivers 15,516 tokens per dollar. Opus 4.7 delivers 3,015. That’s a 5x efficiency gap. For tasks that don’t need Opus-level reasoning, you’re paying 5x more for marginal quality improvements.

Cache hits are the real cost lever. Sessions that achieved warm cache (98%+ hit rate) cost 40-60% less than cold starts with identical prompts.

One sentence in a prompt cut my bill 63%. In my Opus 4.6 vs 4.7 comparison, adding a no-tools constraint to prompts reduced cost by 63% on Opus 4.7. But it also broke instruction-following on 2 of 9 test prompts. Cost savings and correctness don’t always move in the same direction. You have to measure both.

Model choice matters more than most optimizations. Opus 4.6 held 9/9 instruction-following across all five prompt variants I tested. Opus 4.7 scored 6/9 on the no-tools variant. The newer model isn’t automatically better - the “best” model depends on what you’re optimizing for.

The effort knob has diminishing returns. Going from medium to high produces meaningful quality improvements. High to xhigh or max often adds cost without proportional gains unless the task is architecturally complex.

The habits that compound

After hundreds of measured sessions, here’s my ranked list of what actually changes your Claude Code economics:

New session for new tasks. The single biggest waste is cramming unrelated work into one session. Context rot starts at 300-400K tokens.
Always start with plan mode. Front-load context, verify against current docs, establish the cache prefix that saves money on every subsequent turn.
Rewind instead of correcting. Failed attempts add permanent noise. Erase them with Esc Esc.
Match effort to task. Low effort for simple changes. Reserve xhigh/ultrathink for genuine complexity.
Compact proactively with instructions. At logical breakpoints, not when quality degrades. Tell it what to remember.
Shift usage to off-peak hours. Avoid throttling, get longer unbroken sessions.
Keep CLAUDE.md lean (for your least capable model). Every line should prevent a specific mistake.
Use /btw for side questions. Keep informational queries out of your conversation history.
Disable unused MCP servers. Free context savings on every turn.
Use sub-agents for exploration. Keep heavy file reads out of your main context.
Track your costs. You can’t optimize what you don’t measure.

That last point is why I built the Claude Code session-metrics plugin. I couldn’t find these patterns without instrumenting my sessions. The token counts, cache rates, and cost-per-turn data told me things that no amount of casual observation would have surfaced. Check out my Claude Code session-metrics plugin if you want to dig into your Claude Code token usage and usage profiles at individual session level, project level or entire Claude Code instance level.

What’s next

I’m continuing to benchmark new Claude Code releases as they ship. Every model update changes the cost curve, and the effort levels don’t behave identically across model versions. If you want to see the raw data behind these recommendations, the full benchmark results with 12 cross-model charts are in my effort benchmark article.

If you’re building with AI and want practical, measured insights instead of hype, subscribe for future posts. You can also follow my shorter updates on Threads (@george_sl_liu) and Bluesky (@georgesl.bsky.social).

Claude Cowork Live Artifacts: First Run With My Claude Code Metrics MCP Server

George Liu — Fri, 01 May 2026 14:50:50 GMT

In this post:

What a live artifact is, in one paragraph
The three properties that actually matter
Where the Metrics MCP server fits in
Step 1: adding the server to Claude Desktop
Step 2: testing it in Claude Code first
Step 3: first time inside Live artifacts
Step 4: probe the tools before you build
Step 5: approving the artifact
Step 6: the finished dashboard
What didn’t work
What I learned
What’s next

What a live artifact is, in one paragraph

A live artifact is a self-contained HTML page that Claude Cowork saves to your sidebar that allows you to build dashboards and trackers connected to your apps and files. It differs from a regular chat artifact in three ways. It persists across sessions, so it is still there the next time you open Claude Cowork. It can be reopened like a browser tab, not re-requested from the model. And at open-time, it can call your connectors (MCP servers) directly from inside the page, so the numbers on screen are fresh every time. The sidebar just calls the tools again and re-renders.

What Claude Cowork’s sidebar calls “Live artifacts” is the same capability family Anthropic writes about as interactive connectors and MCP Apps. The public docs frame it as “tools showing up as interactive connectors right in the conversation,” and the help center article Use interactive connectors in Claude explains that “some connectors can now display live, interactive apps directly within your Claude conversations.” Claude Cowork picks up the same feature and surfaces it in its own sidebar with its own wording. Same underlying capability, two entry points.

If you have used Claude Code, think of it as a persistent HTML file generated by the model that is wired directly to your MCP tool schemas. If you have not used Claude Code, think of it as a Grafana panel where the datasource is your own Claude connectors, assembled by conversation.

The three properties that actually matter

Before I get to the build, I want to be specific about the three properties because they determine which problems this feature is good for.

It persists. A regular chat artifact lives and dies with the thread. If you close the tab, you have to ask the model to rebuild it. A live artifact sits in the Live artifacts section of the sidebar and survives session changes, app restarts, and project switches. You can reopen it tomorrow.

It re-opens without recomputing. When you click the sidebar entry, the HTML loads from storage. The model is not involved on reopen. That means no token cost, no round trip, no prompt queue. The only thing that runs is the JavaScript inside the artifact, which might make connector calls of its own.

It can call connectors from inside the page. This is the interesting one. The artifact has a browser-side bridge, window.cowork.callMcpTool, that lets the HTML invoke your MCP tools directly at open-time. So a cost dashboard can call get_current_cost the moment you click it and show today’s number, not a snapshot from when the artifact was created. Anthropic’s interactive connectors announcement puts the design principle this way: MCP Apps “lets any MCP server deliver interactive connectors with a rich user interface.” The live artifact is what that looks like inside Cowork.

Those three properties together define the sweet spot. A live artifact earns its keep when the data under it changes over time and you will want to look again later. A status tracker. A dashboard. A filtered explorer over a connector. If the data is static, a regular chat artifact or a markdown reply is fine.

Where the Metrics MCP server fits in

I spent the last year wiring a Prometheus + Grafana + Loki stack behind Claude Code so I could see exactly where my token spend was going. The stack is explained in detail at centminmod/claude-code-opentelemetry-setup. Short version: Claude Code emits OpenTelemetry metrics and logs, I scrape them into Prometheus and Loki running in a local Docker container, and Grafana charts them.

That was useful but it lived in a separate browser tab. So I wrote a small local MCP server on top of the same stack. It exposes 14 read-only tools over Prometheus and Loki: get_current_cost, get_token_usage, get_cache_efficiency, get_recent_prompts, get_tool_results, query_prometheus, query_loki, plus a set of Grafana panel helpers (list_dashboard_panels, get_panel_query, explain_panel_query, find_panel_by_name, explain_promql_query, explain_logql_query, get_available_metrics). Because it speaks stdio and runs on my own machine, it fits the “local MCP server” path Anthropic documents for Claude Desktop. The help center article Getting Started with Local MCP Servers on Claude Desktop is the authoritative reference for that path, and it is the right choice for any tool that should not be exposed to the public internet, like a Grafana stack bound to localhost. I had been using the server from inside Claude Code for a while. The new question was whether Claude Desktop and Cowork could see it too, and whether a live artifact on top of it would be useful.

The short answer turned out to be yes on both counts.

Step 1: adding the server to Claude Desktop

The Metrics MCP server is a local stdio server, same as every other MCP server I run. So adding it to Claude Desktop is a config edit, not a marketplace install. The desktop app reads its config from:

/Users/george/Library/Application Support/Claude/claude_desktop_config.json

My metrics entry looks like this:

"metrics": {
  "command": "/Users/george/.local/bin/uv",
  "args": [
    "run",
    "--directory",
    "/path/to/mcp-server",
    "metrics-server"
  ]
}

uv run --directory resolves the server’s own pyproject.toml and runs the metrics-server entry point from it. That avoids a global install and keeps Python dependencies scoped to the server repo.

After editing the file I restarted the Claude Desktop app and opened the Connectors page. The metrics server showed up under Desktop with a LOCAL DEV tag and all 14 tools listed. Each tool defaults to “Needs approval” until I grant it a blanket permission, which matches the safety model Anthropic describes in Use connectors to extend Claude’s capabilities.

That Connectors page is the ground truth. If the server does not appear here, nothing else on the desktop side can see it, including Cowork.

Step 2: testing it in Claude Code first

I did not go straight to Cowork. I tested the server from Claude Code’s Code surface on the desktop side first, because CLI-style output makes it easy to confirm the data is real before you build a UI on top of it.

The first test was the most boring possible one:

Hi, are you able to query my metric server to get my Claude Code usage statistics?

It pulled 24-hour numbers on the first pass and 7-day numbers on the follow-up. Total cost $51.00 over the previous 24 hours, $564.13 over 7 days, cache hit rate just under 97% on the 7-day window. The numbers matched what my Grafana dashboard was showing in a separate tab, so the tool was wired up end to end.

Then I asked for a combined view across three windows:

Can you use the metric server to get my Claude Code usage statistics for the past 24 hours, 7 days, and 30 days?

This response is more interesting than it looks. Claude ran query_prometheus in parallel across three time ranges and stitched the results into one table. 24-hour cost was $51.30, 7-day was $564.17, 30-day was $2,186.10. Active CLI time climbed from 1h 24m over the last day to 89h 43m over the month. That is what a month of Claude Code work looks like when you measure it.

I also asked how far back the data actually went, because I wanted to know the ceiling before I promised a 30-day view in a dashboard:

How far do the statistics usage collection go? How many days can I go and look up?

The answer was concrete: Prometheus retention is 30 days with a 10 GB size cap, Loki retention is 30 days (720 hours). Earliest datapoint right now is 2026-03-24, or about 30.7 days of history available. The tool read this from my docker-compose.yml and loki-config.yml and explained that anything older than 30 days is dropped because Prometheus TSDB compacts and deletes older blocks. That confirmed the dashboard’s 30-day view is the practical max.

CLI testing before UI work is a habit I keep coming back to. It is faster to type three questions into Claude Code than to watch an HTML artifact render, iterate, and rebuild. You find the shape of the data first, then you build the UI.

Step 3: first time inside Live artifacts

Cowork’s Live artifacts section is reached from the left sidebar. On first open it is empty, with three suggested prompts to get you started: “Week at a glance”, “Unread email digest”, and “What needs my attention”. A tagline above reads “Create dynamic artifacts that stay up-to-date using live data from your connectors.”

I clicked New artifact and typed what I actually wanted, which was not an artifact yet but a scoping conversation:

I want to make a live artifact. Explain what live artifacts are in Cowork, then look at my connectors (MCP servers), and ask me a few questions to figure out what kind of live artifact would be most useful for me.

That was the most useful prompt I wrote all session. Two reasons. First, it forced the model to define live artifacts back to me in its own words before we built anything, which exposes any wrong assumptions. Second, it grounded the design in the specific MCP servers I had connected rather than a generic template.

The model defined live artifacts correctly. It called them “self-contained HTML pages Cowork saves to your sidebar” with two key properties: “persists across sessions, so you can re-open it any time without asking me to recompute anything” and “can call your connectors (MCP servers) at open-time via a browser-side window.cowork.callMcpTool bridge, so the data refreshes with current state every time you open it.”

Then it produced a punch list of everything it could see in my connectors: Obsidian vault, Cloudflare (D1, KV, R2, Workers, Hyperdrive), Claude usage metrics (Prometheus/Loki, cost, cache efficiency, dashboard panels), Google Drive, scheduled tasks, Context7 library docs, Cowork session history, browser-control MCPs. From that list it proposed seven plausible artifact directions with trade-offs for each: an Obsidian “today” dashboard, a Cloudflare infra inventory, a Claude cost and usage tracker, a Drive recent-files explorer, a scheduled-tasks control panel, a Context7 lookup pane, and a unified morning dashboard.

I declined it’s suggested connectors and I picked my Claude Code usage metrics MCP server as a live artifacts health dashboard.

Step 4: probe the tools before you build

This step is the one I want other Cowork users to copy verbatim. Before Claude built anything, it wrote this plan:

Plan: probe each metrics tool once to see the real response shapes, then build a single artifact with all six shapes as sections, a time-window toggle (today / 7d / 30d), and a model breakdown where the data supports it. Once you’ve lived with it for a day you can tell me which sections to cut and I’ll trim.

Then it ran each tool once with a minimal payload, read the actual response, and told me what it saw. That is a pattern worth naming.

The Cowork system prompt for artifacts actually instructs Claude to do this. The guidance reads: “Probe the tool before you build. Before writing an artifact that calls a connector tool, call that tool once in chat with a small representative payload and look at the actual response. MCP wrappers often rename parameters and reshape or stringify output relative to the underlying service’s native API, so don’t assume the shape.” That guidance is in Cowork’s built-in instructions, and you can see it in action in my session when Claude asked approval for seven probe calls in a row before touching the artifact. The same principle maps onto Anthropic’s framing of interactive connectors more generally: the announcement post describes the feature as “collaborate in real time” with the tool, which only works if the first interaction with the tool is a small read, not a speculative write (Interactive connectors and MCP Apps).

After the probe, it reported:

All queries validated. Cost by model shows your real split (Opus 4.7 dominates, Sonnet 4.6 secondary, plus some local models). 7d total is ~$565, 30d is ~$2182, 24h cache hit rate is ~97%. Good grounding to build on.

Two things are worth noting here. The 7-day and 30-day totals now match the earlier Claude Code CLI numbers to the dollar, which means the artifact’s data source and the CLI’s data source are the same Prometheus series. And the “Opus 4.7 dominates” observation is something only the real probe could surface, because the dataset itself has the answer inside it.

Step 5: approving the artifact

Before the artifact writes any HTML, Cowork surfaces a single approval dialog that shows exactly what the artifact will do and which connectors it will be allowed to call on future opens without asking again. This is where you decide how much autonomy to give the page.

Two things to look at on this dialog. The list of connector tools is narrow and explicit. My artifact gets get_current_cost, get_token_usage, get_cache_efficiency, get_recent_prompts, get_tool_results, and query_prometheus. Those are all read-only. No writes, no deletes, no Loki log access (the Grafana panel mirror ended up being served by query_prometheus alone). The scope is smaller than what the MCP server exposes.

And the description is a capsule summary of the build, which is useful later. When I come back to this sidebar item in three weeks I will see the six candidate sections in one glance and know what the artifact does without opening it.

I clicked Create. The artifact appeared in the sidebar about twenty seconds later.

Step 6: the finished dashboard

What landed in the sidebar is a single HTML page with a time-window toggle at the top (Today 24h / Last 7 days / Last 30 days) and six stacked sections. Switching windows is instant because all three windows are fetched on load; only a manual Reload re-queries the connector.

The six sections are:

Cost tracker. Total spend, daily average, burn rate, plus a bar chart of cost by model (top 6 with everything else bucketed as Other). For the 24-hour window, total spend is $51.89 and Opus 4.7 dominates the bars.
Tokens and cache. Stacked bar of token types (input, output, cacheCreation, cacheRead) by model, plus a doughnut gauge for cache hit rate. Cache is at 96.7% today. That is the single biggest lever on my bill.
Recent activity feed. Last 12 prompts and 20 tool executions from Loki, time-stamped individually and window-independent by design.
Consolidated health strip. Four big stats in one row: spend, tokens, cache hit rate, active sessions. Meant for the fast-glance view.
Grafana panel mirror. A table of nine derived metrics (cost per 1K tokens, tokens per dollar, avg cost per session, and so on). Pulls from the same PromQL as my Grafana dashboards, so the values should match.
Anomaly strip. Three delta chips: today’s spend vs 7-day rolling daily average, today’s cache rate vs 7-day cache rate (in percentage points), today’s sessions vs 7-day average. Red up-arrow for a spike, green down-arrow for a drop.

When I flip to the 30-day toggle the numbers shift cleanly: total spend $2,182, daily average $72.75, total tokens 3.66 billion, cache hit rate 96.6%, active sessions 330. Those numbers are what three weeks of Claude Code and Cowork usage look like on my account. They are also the only numbers on my screen that recompute automatically. If I reopen this dashboard tomorrow morning over coffee, the “Today” column will reflect yesterday’s work without me touching a chat window.

The sidebar entry itself shows the artifact’s title, a one-line description (“Live Claude Code usage dashboard with six candidate sections…”), and a “2 minutes ago” last-opened stamp. It is a real object I can return to, not a scroll position in a chat.

What didn’t work

Three honest notes before I make this sound too clean.

First, my first attempt at the prompt was too prescriptive. I tried to dictate which sections to include up front. The artifact that produced was rigid and did not reflect the real shape of my data. The fix was the scoping conversation in Step 3, which let the model see my connectors before locking in the section list.

Second, the anomaly strip was not useful on day one. With only a few hours of “today” data, the delta chips compared a partial day to a full 7-day average and flagged everything as anomalous. I kept the section in the artifact but realized I probably need either a minimum-sample guard or a “day is still in progress” badge before it gives reliable signal. That is an iteration for later, not a blocker for day one.

Third, the 30-day window bumps up against the Prometheus retention ceiling. My earliest datapoint is 2026-03-24. If I want a true trailing 30 days every day, I either need to raise --storage.tsdb.retention.time and scale the 10 GB size cap, or export older blocks to a long-term store. The dashboard does not know that today; it just shows whatever is in the window.

What I learned

The feature earns its keep when the data changes and you will reopen it. That is the filter I will use for every future artifact. A live artifact for a one-time answer is wasted effort. A live artifact for a thing you want to glance at tomorrow morning is a dashboard you just built without the Grafana overhead.

Probe first, build once. The “probe each tool once with a small payload and read the actual response” pattern was the single biggest accuracy gain in this build. I have had similar wins in regular MCP work, but the Cowork system prompt actually instructs the model to do this for artifacts, which makes the pattern automatic. If you are building your own Cowork artifacts, ride that. If you are writing your own MCP tooling for agents, borrow it.

Scope the connector list tighter than the MCP server exposes. The approval dialog showed exactly six tools out of fourteen my server exposes. The other eight (query_loki, the Grafana panel helpers, raw metric listings) were not needed for this artifact and were correctly left out. Narrower scope means fewer surprise permission prompts later and a smaller blast radius if the HTML is ever shared.

CLI test the data before you build the UI. Three prompts in Claude Code confirmed the data shape before Cowork did any rendering. That saved at least one iteration. It also let me discover the 30-day retention ceiling before I built a UI that promised longer windows.

The sidebar changes the cost calculus. A chat artifact costs a model round-trip to rebuild. A live artifact costs one round-trip to build and then local JavaScript to reopen. Over a month of reopens, the math on a daily-glance dashboard is hugely in the live artifact’s favor.

What’s next

Three things I want to do with this first artifact.

Trim the six-section layout down to four. Cost tracker, tokens and cache, consolidated health strip, and Grafana panel mirror are the four I actually look at. The recent activity feed is nice but I have Grafana for that. The anomaly strip needs the minimum-sample guard before it carries weight.

Build a morning dashboard that unions the metrics artifact with my Obsidian daily note. The sketch the model offered in Step 3 mentioned this pattern, and it is the most obvious next step: one page that tells me what I spent yesterday on Claude Code and what I told my own second brain I should be working on today. This is exactly the kind of workflow Anthropic positions Cowork around on its product page: “agentic AI for knowledge work” that moves between your files and apps so the coordination work drops off your plate.

Write a short Cowork-specific note on artifact prompting patterns. The “define it back to me, look at my connectors, ask me questions” prompt was unusually productive. I want to turn that into a reusable template.

I Gave Claude Cowork an Obsidian Second Brain. Here Is What It Remembered After 11 Sessions

George Liu — Thu, 30 Apr 2026 03:41:50 GMT

In Part 1, I built a persistent memory system for my AI workflow using Obsidian, a custom MCP server, and Claude Opus 4.6 in Cowork. The system had 16 MCP tools, a structured vault with frontmatter metadata, Dataview queries for structured retrieval, and a context budget of 5 MCP calls at session start. It worked in the controlled environment of a single build session.

That is the easy part.

The hard question was always whether it would stay useful when real work started piling in. When sessions got messy. When I was focused on deadlines instead of metadata hygiene. When the vault grew from 7 seed notes to something larger.

So I used it. The Substack project had been running for 5 sessions before I built the second brain. Over the next 11 sessions (sessions 6 through 16), I drafted articles, debugged tooling, generated images, and reorganized content. I did not treat it gently. I treated it the way you treat any tool when you are busy: I used it when I remembered and skipped it when I was in a hurry.

Here is one example before I explain the setup. In session 8, I spent 20 minutes figuring out that a sandboxed environment cannot delete certain lock files using normal commands. The fix (os.rename() works where rm does not) got documented. Five sessions later, the same problem appeared during a different task. This time, Claude found the existing entry and applied the fix in under a minute. That one entry saved at least 15 minutes and the frustration of hitting the same wall twice. That is what a working second brain does.

Example saving Claude Cowork project’s local memory bank and Obsidian second brain memory - saving to an Obsidian session 11 memory file.

Recalling my Claude Cowork project’s session work done ~12 days ago using Obsidian MCP & Skill bundle.

Inside the Obsidian notes saved for the Claude Cowork project’s 2nd brain for session 20 on April 11, 2026.

How the MCP server exposes the second brain

Quick mental model, then back to the results.

The core problem is simple: AI assistants like Claude lose their memory between sessions. My CLAUDE.md files solve this for project state (what is happening right now, what to do next), but they cannot scale to hold research, retrospectives, troubleshooting knowledge, and session history without bloating the context window.

Obsidian solves this by being a queryable knowledge store that sits outside the context window. The AI does not load everything at startup. It loads a single entry point (the context manifest), reads enough to understand what the current session needs, then queries for specific knowledge on demand.

The bridge between Claude and Obsidian is a custom MCP server (the adapter that lets Claude read, write, and search the vault). It is a lightweight Python script that exposes 16 tools over the Model Context Protocol. Each tool maps to an Obsidian REST API endpoint with the correct headers set explicitly (the reason it is custom-built, as Part 1 explained). The server runs on your local machine so it can reach Obsidian’s local API.

Here is what those 16 tools look like in practice:

Reading and writing (6 tools):

obsidian_read_note – fetches a note’s full markdown
obsidian_read_note_json – returns just the frontmatter metadata as structured JSON, useful when you only need type/status/tags without the full body
obsidian_write_note – creates or overwrites a note
obsidian_append_note – adds content to the end without touching what is already there
obsidian_patch_note – surgical edits by heading or block reference (though I discovered it fails on files over roughly 12K characters)
obsidian_delete_note – removes a note

Searching and querying (3 tools):

obsidian_search – full-text keyword search across the vault
obsidian_search_dataview – runs Dataview DQL queries (Obsidian’s query language for note metadata) against structured frontmatter. This is the primary retrieval mechanism. Roughly 80% of my queries go through Dataview because the frontmatter schema makes filtering precise.
obsidian_search_jsonlogic – complex boolean filtering for edge cases that DQL cannot express

Navigating and batch operations (7 tools):

obsidian_list_dir – shows directory contents
obsidian_tags – returns all tags with counts (the shape of your knowledge at a glance)
obsidian_recent_changes – lists recently modified files
obsidian_batch_read – fetches multiple notes in one MCP call, saving round trips when loading session context
obsidian_status – health check on the vault connection
obsidian_periodic_note – creates or opens daily/weekly notes
obsidian_open_note – opens a note in the Obsidian desktop UI

The tools let the AI browse a knowledge base the way a human would: start with an overview, drill into what is relevant, write back what it learned. The difference is speed and precision, structured queries instead of clicking around.

What makes the queries precise is structured frontmatter on every note:

---
tags:
  - cowork/george-substack
  - type/session-log
  - topic/mcp-setup
created: 2026-04-06
updated: 2026-04-06
type: session-log
status: active
confidence: high
---

Instead of full-text searching for “which sessions discussed image generation,” the AI can run:

TABLE type, status, created
FROM "Claude-Cowork/George-Substack"
WHERE contains(tags, "topic/image-generation")
SORT file.mtime DESC

In plain English: “Show me every note in the project folder tagged with image-generation, sorted newest first, and include its type, status, and creation date.” The AI gets back a structured table instead of scanning every file. No guessing, no context wasted on irrelevant notes.

Where the system worked: session continuity

The clearest win was session-to-session continuity. Each session starts by reading the context manifest, a single Obsidian note that contains the current project state, what happened last session, and a routing table that tells the AI what to load based on the task type.

Every new session started with enough context to be productive within the first exchange. No “remind me what we were working on.” No re-explaining decisions. Here is what a typical session start looks like:

Claude reads the context manifest (1 MCP call). This tells it: the project has 18 post folders, 5 are published, one draft was completed yesterday, here is the routing table for what to load next.
Claude reads the last session log (1 MCP call). This tells it: yesterday we drafted a tutorial article, generated 4 images, hit a swap memory issue with a local AI model, and documented the workaround.
Based on what I ask for, Claude loads 1-2 targeted notes. If I say “let’s work on the next article,” it loads the ideas bank. If I say “debug the image generation tool,” it loads the troubleshooting entries.

Total: 3-4 MCP calls, under 10 seconds, and Claude has enough context to work as if it had been present for the entire project history.

This was most noticeable when sessions were closely spaced. I had stretches where three or four sessions happened in the same day, each building on the previous one. The handoff was clean because the manifest is small (under 100 lines), focused (only current state and routing), and always up to date.

The session logs added a layer beyond just “what happened.” Each log captured decisions made and why. In session 12, I chose a narrative format for an article instead of pure tutorial. Three sessions later, when deciding the format for a different article, Claude read the session 12 log, found that decision and its rationale, and suggested the same approach. Without the log, we would have had the same discussion from scratch.

Where the system worked: troubleshooting memory

The second big win was accumulated troubleshooting knowledge.

Over 11 sessions, I hit roughly 10 distinct technical problems: API limitations, sandbox filesystem quirks, tool configuration errors, dependency version issues, and platform-specific gotchas. Each one got documented, either in the Obsidian vault or in the project’s troubleshooting file (which the AI loads at session start).

The payoff came when the same class of problem appeared again. Here is a real example: in session 8, I discovered that the sandboxed environment cannot delete certain lock files using normal commands (rm, unlink, Python’s os.remove()). After 20 minutes of debugging, I found that os.rename() works where delete does not. (Yes, this is arguably a sandbox escape bug. The point here is not the fix itself but that it got documented.) That workaround was recorded in the troubleshooting file.

In session 13, the exact same lock file problem appeared during a different task. This time, Claude found the existing entry and applied the os.rename() fix in under a minute. No re-discovery, no debugging, no wasted time. One documented entry saved at least 15 minutes and the frustration of hitting the same wall twice.

Another example: the Obsidian REST API has a quirk where the PATCH endpoint silently fails on files over ~12,000 characters. I discovered this the hard way in session 16 when trying to append to a growing monthly rollup file. The API returned HTTP 400 with no useful error message. After diagnosing the size limit, I documented the workaround (use write_note to overwrite the full file instead of patching) and changed the default workflow to use smaller, standalone files. Every future session that touches large vault notes will avoid this trap entirely.

By session 16, the accumulated knowledge covered roughly 10 documented issues. New problems were more likely to be variations of something already documented than entirely novel. The second brain was actively reducing the time cost of future sessions.

Where the system worked: the context budget held

Part 1 set a rule: maximum 5 MCP calls at session start. The concern was that as the vault grew, startup would get heavier and eat into the context budget.

After 11 sessions, the vault grew from 7 seed notes to 17 across three directories. The 5-call budget was never exceeded. A typical start used 3-4 calls: manifest, last session log, and 1-2 targeted notes.

The routing table in the manifest is what makes this work:

Writing or planning a post – load the ideas bank and any existing research notes on the topic
Growth planning – load growth experiments and audience insights
Technical debugging – load relevant troubleshooting entries and research notes
Content calendar review – load the ideas bank
General continuation – load just the manifest and last session log (the minimum viable context)

Most of the vault stays unloaded in any given session. A session focused on writing an article never touches the troubleshooting entries. A session focused on debugging never touches the content planning notes. This is the opposite of “load everything and hope the AI figures out what is relevant.”

At 17 notes, 5 calls is comfortable. At 170, it would still work because the manifest and routing table stay small. The vault grows; the startup cost stays flat. The bottleneck would be the routing table becoming stale, not the vault becoming too large.

For comparison: without the second brain, all session history, research, and troubleshooting would need to live in CLAUDE.md files that load into every session. That approach works at 5 sessions. It does not work at 50.

Where the system fell short: metadata discipline

I ran a Dataview query against the vault to check frontmatter consistency:

TABLE type, status
FROM "Claude-Cowork/George-Substack"
WHERE type = null

Two of 17 notes came back with null values for type and status. They had been created during busy sessions where the focus was on getting content written, not on filling out metadata. The notes existed, the content was there, but the frontmatter was missing or incomplete.

This matters because Dataview queries are the primary retrieval mechanism. A note without proper frontmatter is invisible to structured queries. It can still be found by full-text search, but it will not show up in any WHERE type = "session-log" filter. For a system that relies on metadata for routing and discovery, missing frontmatter is the equivalent of a misfiled document in a cabinet system. The document exists but cannot be found through normal channels.

The root cause was not technical. The schema is well-defined. The problem was behavioral: when a session is moving fast and the priority is getting work done, metadata feels like overhead. It is the kind of task that is easy to skip in the moment and painful to reconstruct later.

Where the system fell short: inconsistent status values

A subtler problem emerged in the status field. Session logs used three different values for the same concept: active, complete, and completed. There was no canonical list of allowed values, so each session picked whichever felt natural at the time.

This creates a filtering problem. A Dataview query for WHERE status = "complete" misses notes marked completed. A query for WHERE status = "active" returns both genuinely in-progress work and finished sessions that were never updated. The inconsistency is small but compounds as the vault grows.

The fix is straightforward: define a controlled vocabulary. Four values cover every case: active (current, in use), complete (finished), draft (work in progress), and stale (needs review or update). Document it in the context manifest so every session sees it. Then do a one-time cleanup pass across existing notes to normalize the values.

This took about five minutes to implement. The lesson is that vocabulary drift starts immediately, not after months of usage. If you are building a similar system, define your allowed values on day one and put them somewhere the AI sees every session.

Where the system fell short: unused content planning notes

The vault has four content planning notes: an ideas bank, audience insights, growth experiments, and a distribution playbook. All four were created during the initial build session with good structure and placeholder content. After 11 sessions of real usage, only the ideas bank received any updates.

The other three sat untouched. Not because they were poorly designed, but because the 11 sessions were almost entirely focused on drafting and editing. There was no audience data to log (the publication was too new) and no growth experiments to record (the focus was building a content backlog).

This highlights a common trap in memory system design: building structure for workflows that have not started yet. The notes are well-structured and will be useful eventually, but right now they are empty scaffolding.

The takeaway: seed only the notes you will use in the first month. Add structure for future workflows when those workflows actually begin.

The audit that changed the process

Midway through this evaluation, I ran a systematic audit of the vault using Dataview queries. This was not planned. I was preparing for this article and realized I did not actually know the state of my own knowledge base.

The audit was three queries:

Missing frontmatter: TABLE file.name FROM "Claude-Cowork/George-Substack" WHERE type = null returned 2 notes out of 17. Both were session logs created during intensive drafting sessions.

Inconsistent status values: TABLE file.name, status FROM "Claude-Cowork/George-Substack" WHERE status != null revealed three different values for “finished” across session logs.

Last-updated staleness: TABLE file.name, updated FROM "Claude-Cowork/George-Substack" WHERE type = "reference" SORT updated ASC showed that the content planning notes had not been touched since day one.

The audit took under five minutes. The fixes took another ten. But the audit revealed something more important than the individual findings: the system had no built-in self-check mechanism.

Discipline held for roughly 60% of sessions and slipped in the rest. The pattern was predictable: focused creative sessions (writing, generating images) ended with clean updates. Multi-task or long-running sessions skipped metadata. The sessions that generated the most knowledge were the ones most likely to skip maintenance.

Same dynamic as testing in a codebase. When you are moving fast, tests feel like overhead. Until you ship a bug that tests would have caught. Three Dataview queries, run once a week, would have caught every issue I found.

Six process fixes

Based on the audit, I implemented six changes. Each takes under 10 minutes. Together they determine whether a memory system stays useful past the first month.

1. Frontmatter validation at session end. Added a Dataview query to the session-end checklist: TABLE file.name FROM "Claude-Cowork/George-Substack" WHERE type = null. If any notes come back, fix them before closing the session. This catches the “created in a hurry, forgot the metadata” problem.

2. Controlled status vocabulary. Defined four allowed values: active, complete, draft, stale. Documented in the context manifest where every session reads it. Ran a one-time cleanup across existing notes. No more completed vs complete ambiguity.

3. Session-end reminder in the active context file. Added a one-line HTML comment at the bottom of the session history section pointing to the update-memory skill (a structured checklist that walks the AI through updating all memory files at session end). This puts the reminder in the file that every session reads, right next to where session logs get written. It is a nudge, not enforcement, but nudges at the right moment are often enough.

4. Standalone session log files as the default. Every session gets its own file at sessions/YYYY-MM-DD-session-NN.md. The monthly rollup contains only a one-line-per-session index table with wikilinks, not full session details. This keeps the rollup small, avoids the API size limit I discovered on large file patches, and makes individual sessions independently queryable via Dataview.

5. Lightweight reflection prompts at session end. Two optional questions added to the checklist: “Did this session produce a reusable insight?” (if yes, append to the relevant content note) and “Did any external signal come in?” (reader feedback, subscriber data, engagement metrics). These are not mandatory, but making them visible prevents the content planning notes from sitting idle indefinitely.

6. Post-retrospective template. Created a lightweight note template for capturing what worked, what to change, and reusable insights after each published article. Even 5-10 lines per article creates searchable institutional memory over time. The first retrospective becomes the proof of concept.

None of these are technically complex. They are all process discipline. That is the core insight from 11 sessions of real usage: the hard part of a persistent memory system is not building it. It is maintaining it.

Early results from the fixes. I implemented these changes during the session where I wrote this article, so they got an immediate field test. The frontmatter validation query returned zero nulls after cleanup. The controlled vocabulary check found zero rogue status values. The standalone session log worked cleanly. One issue persisted: the patch_note API size limit hit me twice more during this session, on two different files that had grown past the threshold. The documented workaround (full rewrite via write_note) resolved it each time, but the pattern is telling. Documented knowledge does not always prevent you from trying the broken path first. It just shortens recovery from minutes to seconds.

What I learned (that Part 1 could not teach)

Building a memory system and maintaining a memory system are different skills. Part 1 was an engineering project: design, build, test, ship. Maintaining the system is a habits project: consistency, discipline, auditing. The build took one session. The maintenance is ongoing and will never be “done.” Anyone building a similar system should budget as much energy for the maintenance habits as for the initial build.

Metadata is the system. Without it, you just have a folder of markdown files. The frontmatter, the tags, the status values, the Dataview queries that filter on them: this is what makes Obsidian a queryable knowledge base instead of a fancy file system. When metadata is missing or inconsistent, the system degrades silently. Notes exist but cannot be found. Queries return incomplete results. The failure mode is not “it breaks” but “it slowly becomes less useful without anyone noticing.”

The context budget architecture from Part 1 held up. Five MCP calls at session start, task-type routing, progressive loading. At 17 notes and 11 sessions, this was never strained. The routing table meant most sessions loaded 3-4 notes and left the rest untouched. The vault doubled in size and the startup cost stayed flat. This is the most encouraging result for long-term viability.

Troubleshooting memory is the highest-value knowledge type. Session logs are useful for continuity but individually low-value. Content planning notes are useful in theory but were not exercised. Troubleshooting entries actively prevented repeated work and saved real time. If you can only maintain one category of knowledge in your second brain, make it the things that went wrong and how you fixed them.

Audit your knowledge base the way you audit your code. A knowledge base without health checks will drift. Frontmatter goes missing, status values diverge, notes go stale. A 5-minute Dataview audit at the end of each week catches these problems before they compound. The audit I ran for this article found issues in 11% of notes (2 out of 17). In a vault of 200 notes, that ratio would mean 22 unfindable notes. Catch it early.

Part 1’s design was mostly right, but incomplete. The architecture (domain split, context budget, routing) was solid. The vault structure was sound. The MCP tooling was reliable. What was missing was the maintenance layer: validation queries, controlled vocabularies, reflection prompts, and audit habits. Part 1 built the engine. This article adds the maintenance manual.

What is next

Part 3 will cover the content planning layer that sat idle during these 11 sessions. As the publication matures and audience data starts coming in, the second brain should shift from “remembering what happened” to “surfacing what matters.” Specific questions I want to answer:

Does the Obsidian knowledge graph reveal patterns that are invisible in flat files? Which topics cluster together, which decisions keep recurring, which research feeds into which content?
Can the AI use accumulated context to make better content recommendations? If it has read 10 session logs and 5 post retrospectives, does it suggest better article angles than it would without that history?
What happens when session logs need pruning? At some point the vault will have 50+ session logs. The monthly rollup mechanism handles archival, but does the summarization lose important nuance?

The six process fixes from this article are now live. The next test is whether they hold up over the next round of sessions, or whether maintenance discipline slips again the moment deadlines return. If the fixes work, Part 3 is about growth. If they do not, Part 3 is about why process discipline is harder than engineering.

My Claude AI Already Remembers Everything. I Built It a Second Brain Anyway.

George Liu — Thu, 30 Apr 2026 03:36:50 GMT

Most AI coding tools forget everything between sessions. Mine does not, because I have invested time building a detailed CLAUDE.md memory system: a set of structured markdown files that track project state, decisions, patterns, troubleshooting, and references. These load automatically at the start of every session and give Claude enough context to pick up exactly where the last session left off.

For project state, this works well. Five sessions into building this Substack, Claude knows the publication strategy, the content calendar, the workspace layout, every decision we made and why. No re-explaining.

But CLAUDE.md files have a ceiling.

Research findings, post retrospectives, audience insights, growth experiment results, content planning notes. These are not project state. They are institutional memory, the kind that compounds over time and gets more valuable the more you have. Cramming all of it into flat markdown files that load into every session wastes context window on information that is only relevant 10% of the time. And as the project grows, those files get longer, eating into the context budget that should be spent on actual work.

I needed to level up. Something the AI could query on demand, pulling only the knowledge relevant to the current task. Something with structure, search, and the ability to grow without bloating the context budget. A second brain that sits alongside the CLAUDE.md system, not replacing it.

That distinction matters because a lot of “AI memory” setups collapse two very different jobs into one system. Project memory is about the current state of work: what changed, what is blocked, what happens next. A second brain is about knowledge that compounds over time: research, patterns, lessons, and reusable context. Mixing the two sounds tidy, but in practice it creates bloat, duplication, and uncertainty about what is still current.

So I built one. In a single session. Using Obsidian as the backend, a custom MCP server as the bridge, and Claude Cowork (Anthropic’s desktop app for macOS, currently in research preview) powered by Claude Opus 4.6 as the AI partner.

Full disclosure: I had about 8 hours of total Obsidian experience when I started this project and just a bit more with Claude Cowork as I mainly used Claude Code. I knew it was a markdown-based note app with a plugin ecosystem. I did not know its REST API, Dataview query language, or how to design a vault structure for programmatic access. Claude Opus 4.6 in Cowork carried most of the technical implementation, from writing the MCP server to designing the vault schema to debugging every failure along the way. I directed the architecture and caught the edge cases. Opus did the heavy lifting.

The approach is not Claude Cowork-specific. The MCP server works with Claude Code from the terminal, Claude Desktop, or any MCP-compatible client. I used Claude Cowork because it has file access, a sandboxed shell, and MCP support in a desktop environment, but everything in this post translates directly to Claude Code CLI workflows.

Here is what worked, what broke spectacularly, and what I learned.

CLAUDE.md files track project state. They do not track the knowledge that compounds between sessions.

Why Obsidian

There were several options for a persistent knowledge store. A SQLite database, a SaaS knowledge base, a vector store, or Obsidian.

I picked Obsidian for practical reasons. It is local-first, so no data leaves my machine. It uses plain markdown files, so the content is readable and portable without any special tooling. It has a plugin ecosystem that includes Dataview for live queries and a Local REST API plugin that exposes the vault over HTTP.

The obvious objection is that Obsidian notes are just markdown files, and Claude can already read those directly. True, up to a point. But I wanted more than raw file access: structured frontmatter, live Dataview queries, and a relationship layer where every note links to others via [[wikilinks]] and Obsidian visualizes those connections automatically. Over time, the graph reveals patterns invisible in flat file systems – which topics cluster together, which decisions keep coming up, which research feeds into which posts. A database stores data. Obsidian stores knowledge with relationships. That distinction matters when you are building a memory system that should get smarter over time.

The architecture

Before writing any code, I made three design decisions that shaped everything that followed.

Decision 1: Domain split. The CLAUDE.md files remain the source of truth for project state (what is happening right now, what to do next, current roadmap). Obsidian handles accumulated knowledge (research, retrospectives, audience data, experiments, session logs). No overlap between the two systems. If a fact lives in CLAUDE.md, it does not get duplicated in Obsidian, and vice versa.

This prevents the worst failure mode of AI memory: conflicting information in two places with no way to know which is current.

Decision 2: Context budget. Maximum 5 MCP tool calls at session start. The whole point of a second brain is on-demand access, not dumping everything into context. A routing protocol determines which vault files to load based on the task type. Writing a post? Load the content planning files. Debugging a tool? Load the troubleshooting snippets. Just doing general work? Load only the context manifest (the vault’s entry point) and nothing else. A system that can load everything usually should not.

Decision 3: Progressive disclosure for the AI skill. The skill file that teaches Claude how to use the Obsidian tools is 120 lines. The detailed reference material (context loading protocol, Obsidian features, OpenAPI spec) lives in separate files that are only read when needed. This follows the same pattern I used in my AI Image Creator skill: keep the entry point lightweight, push depth into reference files that load on demand.

I documented these decisions following Anthropic’s official skill-building guide, which recommends the same progressive disclosure pattern for managing AI context efficiently.

Obsidian configuration

Four community plugins, all from Obsidian’s built-in plugin browser:

Local REST API (v3.5.0) – Required. Exposes your vault over HTTPS at

https://127.0.0.1:27124

with API key authentication and a full OpenAPI specification. Every MCP tool talks to Obsidian through this plugin. Copy the auto-generated API key for the MCP server’s OBSIDIAN_API_KEY environment variable.

Dataview (v0.5.68) – Required. Adds a query language (DQL) for searching notes by frontmatter fields, filtering by tags, and returning structured tables. A single query like TABLE file.name, status FROM #type/research WHERE status = "active" SORT file.mtime DESC returns exactly the notes you need, instead of reading files one by one. Also powers live query blocks inside notes that update automatically as the vault changes.

Periodic Notes (v0.0.17) – Optional. Manages daily/weekly/monthly notes with configurable formats and folders. Our obsidian_periodic_note MCP tool uses the REST endpoint this plugin provides. Settings: Daily format YYYY-MM-DD in Periodic/Daily, Weekly format gggg-[W]ww in Periodic/Weekly. Create the folders before configuring the plugin. Important: disable Obsidian’s built-in core “Daily notes” plugin (Settings > Core plugins) to avoid conflicts – Periodic Notes fully replaces it.

Recent Notes (v1.5.5) – Optional. Sidebar panel grouping recently edited notes by time period. The human-facing version of what our obsidian_recent_changes MCP tool does programmatically. No API – its value is for browsing the vault yourself.

If you skip the optional plugins, 14 of the 16 MCP tools still work with just Local REST API and Dataview.

The elegant first attempt

Obsidian’s Local REST API plugin exposes a full OpenAPI specification. I downloaded it from the plugin’s documentation and fed it to Opus. All 2,252 lines of it, covering every vault operation: read, write, search, list, Dataview queries, all documented with schemas, parameters, and response types. Having the complete spec meant Opus could understand the full API surface without me explaining each endpoint.

The FastMCP library for Python has a from_openapi() method that auto-generates MCP tools from an OpenAPI spec. Feed it the spec, get a complete tool server. Opus suggested this approach and wrote the implementation. About 30 lines of code:

from fastmcp import FastMCP

mcp = FastMCP.from_openapi(
    openapi_spec="references/openapi.yaml",
    base_url="https://127.0.0.1:27124",
    headers={"Authorization": f"Bearer {api_key}"}
)

Three lines to bridge Obsidian to Claude. The server started. The tools registered. I could see them in Cowork’s MCP panel. On paper this is exactly the kind of abstraction you want: the API already has a full spec, the MCP framework already knows how to turn specs into tools, less custom glue code to maintain. It looked like the whole project would take 30 minutes.

Feed the spec, get 30 lines of code, watch the tools register. The auto-generation path looked like a 30-minute project

The empty braces mystery

The first read call returned {}.

Not an error. Not a timeout. A perfectly successful HTTP 200 response containing an empty JSON object. Two bytes of nothing.

I checked the vault. The file was there, full of content. I tried different files. Same result. I tried the search endpoint. Same {}. I tried Dataview queries. HTTP 400.

The debugging took longer than I want to admit. The trap was that nothing looked broken at the transport layer. Authentication worked. The server responded. The tools registered. The failure was entirely semantic. The issue turned out to be headers.

Obsidian’s REST API requires specific headers that vary by endpoint. Reading a note as markdown needs Accept: text/markdown. Dataview queries need Content-Type: application/vnd.olrapi.dataview.dql+txt. Without these, the API returns successfully but with empty or wrong content.

from_openapi() sets global headers at server creation time. It has no mechanism to set per-request headers based on which endpoint is being called. So every tool was sending the same generic headers, and every content-aware endpoint was returning empty results. That is a nasty class of bug: it gives you the confidence of success without the substance of success.

The auto-generation approach that looked so elegant could not handle a fundamental requirement of the API it was wrapping.

Every endpoint authenticated, responded HTTP 200, and returned nothing. The missing per-request headers turned success codes into empty content.

The rewrite

Opus replaced the 30-line auto-generated server with a second MCP server: 16 manually defined tools, each setting its own headers explicitly:

@mcp.tool()
async def obsidian_read_note(path: str) -> str:
    """Read a note's markdown content from the vault."""
    async with httpx.AsyncClient(verify=False) as c:
        r = await c.get(
            f"{API_URL}/vault/{path}",
            headers={**AUTH, "Accept": "text/markdown"},
        )
        r.raise_for_status()
        return r.text

Compare that to the auto-generated version which could not set Accept: text/markdown on this specific call. This is less elegant than auto-generation and much more reliable. Once the API required per-endpoint behavior, hand-written tools stopped being technical debt and started being the correct abstraction.

The server is a single Python file using PEP 723 inline dependency metadata. Run it with uv run server.py and the dependencies (fastmcp, httpx) install automatically. No virtual environment, no requirements.txt, no setup steps. One file, one command.

One gotcha worth noting: Obsidian’s REST API uses a self-signed SSL certificate, so the HTTP client needs verify=False. And in Cowork’s sandboxed environment, the sandbox cannot reach 127.0.0.1 on the host machine directly. The MCP server runs on the host (registered in Claude Desktop’s MCP config), not inside the sandbox. This is handled transparently by the MCP protocol, but it is worth understanding if you are debugging connection issues.

I built all 16 tools, grouped by function:

Read/Write: read note (markdown), read note (JSON metadata), write note, append to note, patch note (heading-level update), delete note
Search: full-text search, Dataview DQL queries, JSONLogic queries
Batch/Convenience: batch read (multiple notes in one call), periodic note (daily/weekly/monthly), recent changes (vault-wide, sorted by modification time)
Navigate: list directory contents, get all tags with hierarchy counts, check server status, open a note in the Obsidian UI

Then I tested them against live vault data.

Every tool worked. Here is what real output looks like from a few of the tools, tested against live vault data.

A Dataview DQL query to find active notes sorted by modification time:

TABLE type, status
FROM "Claude-Cowork/George-Substack"
WHERE type != null
SORT file.mtime DESC LIMIT 5

Returns structured results:

[
  {"filename": "research/obsidian-second-brain-series-plan.md",
   "result": {"type": "research", "status": "active"}},
  {"filename": "context-manifest.md",
   "result": {"type": "reference", "status": "active"}},
  {"filename": "sessions/2026-04-06-session.md",
   "result": {"type": "session", "status": "active"}}
]

The batch read tool pulls multiple notes in a single MCP call (returning a JSON map of path to content, with ERROR 404 for missing paths). The recent changes tool walks the vault directory tree and returns files sorted by modification time. All examples above are real data from the live vault, captured during testing.

The seven vault files I had created during initial testing with the from_openapi() server? All empty shells. File size: 2 bytes each. I had been looking at them in Obsidian’s file explorer and assuming they had content because they had frontmatter titles. They did not. The frontmatter was {} too. I rewrote all seven using the custom server and verified each one by reading the content back.

Left: 16 manual tools, each setting its own headers, all returning real data. Right: the auto-generated server that authenticated perfectly and wrote seven empty files.

Lesson: “Does it connect?” is not the same as “Does it work?” Two MCP servers authenticated, registered, and returned HTTP 200 while producing completely wrong output. Always test with real data and verify the actual content of what gets written.

The working system

With the custom MCP server verified, I built out the vault structure and the Claude skill.

The vault lives at Claude-Cowork/George-Substack/ inside my Obsidian vault. Its entry point is context-manifest.md, a table of contents that tells the AI what the vault contains, what is in progress, and where to look. The AI reads this first to decide what else it needs.

The trust boundary is narrow. The AI can query freely and write to specific note types that follow known templates (session logs, research notes, series plans), but it cannot rewrite everything. Memory is useful only if it stays legible and reviewable. A knowledge map (_index.md) contains six Dataview live queries that surface notes by type, status, and recency, updating automatically as the vault changes.

Every note uses structured frontmatter:

---
tags:
  - cowork/george-substack
  - type/session-log
  - topic/obsidian-setup
created: 2026-04-06
updated: 2026-04-06
type: session-log
status: active
confidence: high
---

Tags create a searchable hierarchy. The type and status fields power Dataview queries. The confidence field flags knowledge that might be outdated. Wikilinks ([[related-note]]) connect notes to each other, building the knowledge graph from day one.

The skill teaches Claude how to use this system via a session protocol and a routing table:

The skill teaches Claude how to use this system via a session protocol and a routing table. Writing a post loads content/ideas-bank and content/audience-insights. A post retrospective loads content/growth-experiments. Research or troubleshooting loads the relevant folder listing. General work loads only the context manifest.

At session start, read the context manifest (1 MCP call), then optionally load 1-4 more files based on task type. At session end, write a session log and update the manifest. This keeps startup context under 5 MCP calls regardless of vault size.

This is a scaling strategy, not magic. As the vault grows, noisy tags, stale notes, and accumulating session logs will require pruning and summarization. There is also a subtler problem: the memory system describes itself in multiple places (the skill file, README, CLAUDE.md files, vault notes, and this article). Within one day of building this, I found stale tool counts in seven files and an incorrect narrative in five. AI memory systems need session-end audits the same way codebases need tests. Building the system is the exciting part. Keeping it accurate is the real work.

The fallback design. If Obsidian is not running, the project still works. The CLAUDE.md files contain all project state. The second brain is additive, never a single point of failure.

Context manifest loads first, the routing table picks what else to pull, and the vault grows without bloating the context budget

Obsidian 2nd brain for full memory context that Claude Cowork has access to.

Obsidian graph view.

Using Obsidian MCP & Skill bundle to query what I did in Claude Cowork project ~15 days ago.

Claude Cowork desktop app’s MCP connector listing for Obsidian MCP. You can also see a separate cowork-session-mcp server that I created that allows Claude Cowork projects to read other projects’ sessions even if they are sandbox isolated as well as backup Claude Cowork project sessions to the cloud - Cloudflare R2 S3 object storage and be searched via Cloudflare AI Search RAG system. See how I did that here.

How AI helped build this

This entire system was built in a single Cowork session with Claude Opus 4.6. I came in with 8 hours of Obsidian experience, a general idea of what I wanted, and zero knowledge of the REST API or Dataview query language. Opus handled the implementation: writing the MCP server, designing the vault schema, creating the skill file, and seeding the initial vault notes. Then it used that vault to plan this five-part article series – writing the series plan as an Obsidian research note with frontmatter, wikilinks, and Dataview-queryable metadata. The second brain’s first real job was planning the article about its own creation.

The debugging was genuinely collaborative. When from_openapi() returned {}, I described the symptom and Opus worked through possible causes until it identified the header limitation. Having the full session context meant the diagnosis built on everything we had already explored together.

The lesson for AI-assisted building: the human still needs to push the direction. I made the architecture calls, decided on the domain split, and chose Obsidian. Opus executed on those decisions faster and more thoroughly than I could have alone. The combination worked because each side contributed what it was best at.

What I learned

Architecture matters more than the storage backend, because memory failures are usually loading failures, routing failures, or trust-boundary failures – not markdown-vs-database failures. The domain split (project state vs accumulated knowledge), context budget (max 5 calls), and task-type routing are what make this system practical. You could swap Obsidian for Notion, a SQLite database, or a folder of text files and the architecture would still work. The hard problems are deciding what to load, when to load it, and how to prevent context bloat.

Auto-generation has a ceiling, and it fails silently. from_openapi() is powerful for APIs with uniform header requirements. But when an API needs different content types per endpoint, auto-generation breaks without telling you. The from_openapi() server authenticated correctly, registered tools, and returned HTTP 200 on every call – but the content was {}. The output looks correct at the transport layer while being completely wrong at the data layer. Testing with real data, reading back what you wrote, and verifying file sizes is the only way to catch this class of bug.

AI needs the same UX patterns humans do. Progressive disclosure, clear entry points, routing tables, structured metadata. The skill file is essentially UX design for an AI user. A wall of text works no better for Claude than it does for a human reading documentation.

Design for failure from the start. The CLAUDE.md fallback means the project never depends on Obsidian being available. This is the same principle behind graceful degradation in web development: the enhanced experience is optional, the core experience always works.

Not every workflow needs this much machinery. If your vault is small, your notes are mostly static, and your AI only needs occasional file access, direct filesystem access may be simpler and good enough. The Obsidian + MCP approach earns its complexity when you want structured metadata, live queries, a human-usable interface, controlled write paths, and memory that works across sessions and clients without dumping whole folders into context every time.

Documentation drifts and bugs hide in the gaps. After the initial build, the tool count was stated as “14” in seven files. When I added three new tools, the count became “17” in the draft but was actually 16 (I miscounted). Separately, the recent_changes tool passed basic testing but failed on nested directories because it used relative filenames instead of full paths – a bug that only appeared with real directory depth. Both problems share the same root cause: a system that describes itself in multiple places will contradict itself faster than you expect. The fix is a session-end audit: grep for known facts across all files before closing out. I built a checklist for this and added it to the skill’s session-end protocol.

Not everything worked perfectly. After creating notes via the API, Obsidian’s full-text search initially only matched filenames, not content. The indexer had not caught up with files created programmatically. Tags also returned empty for a while before populating. If you build something similar, expect a brief lag between writing a note and being able to search its contents. Dataview queries (which read frontmatter directly) worked immediately, which is why they are the primary query mechanism in this system.

Using this outside Cowork

I built this in Cowork, but the MCP server is a standalone Python script that works with any MCP-compatible client: Claude Code CLI, Claude Desktop, Cursor, Windsurf, Cline, or others adopting the Model Context Protocol. Add the server to your MCP config, point it at your Obsidian vault, and the 16 tools are available in every session. The skill file (SKILL.md) drops into any project’s .claude/skills/ directory and works the same way. The from_openapi() header limitation applies to any client, so building the custom server from the start skips the issue entirely.

Before building your own, consider mcp-obsidian by Markus Pfundstein (uvx mcp-obsidian). It handles per-request headers correctly and covers batch reads, periodic notes, and recent changes. The main gap: no Dataview DQL support, which is our core query mechanism for structured retrieval. If you only need file read/write and basic search, mcp-obsidian is the simpler path.

For testing either server, MCP Inspector gives you a browser UI to test tools, inspect responses, and view logs – no install required:

npx @modelcontextprotocol/inspector uv --directory path/to/server run server.py

The pattern also works outside coding workflows. Meeting notes, research, editorial planning, decision logs – anything where you want to keep active work separate from accumulated knowledge and load only the slice the current task needs.

What’s next

This is Part 1 of a series. The system exists. Now the real test begins: using it across multiple sessions and seeing whether persistent memory actually changes how the AI works with this project. Part 2 starts here.

Questions I want to answer in future parts:

Does the knowledge graph reveal useful patterns over time?
Does the context budget hold as the vault grows?
What happens when session logs accumulate and need pruning?
How does having memory change the quality of content planning and post drafting?

The system worked in the controlled environment of a build session. The next question is whether it stays useful when the vault gets messier, the sessions get longer, and the memory starts competing with itself.

Regain Access To Claude Opus 4.6 And Opus 4.5 In Claude Code CLI

George Liu — Tue, 28 Apr 2026 02:45:28 GMT

Since the Claude Opus 4.7 release, some Claude Code users have reported degraded performance and want to revert to Opus 4.6 or Opus 4.5. I've benchmarked the differences across models if you want the numbers: Claude Opus 4.5 vs Opus 4.6 vs Opus 4.7 vs Sonnet 4.6 and Claude Opus 4.6 vs Opus 4.7 Effort Levels And Prompt Steering Benchmarks.

Here’s how you can go back to Claude Opus 4.6 in your Claude Code CLI /model selection. You can still do it via command line method, pass —-model flag with claude-opus-4-6 or claude-opus-4-6[1m]. However, being able to do it via /model selection might be preferred.

The following three config file environmental variables need to be set in ~/.claude/settings.json config file as per official documentation. You can only set the variables once, so you can only readd one custom model back.

Claude Opus 4.6 1m model selection via Claude Code CLI /model

Claude Opus 4.6 1m model /context display

~/.claude/settings.json config file

For Claude Opus 4.6 1m token model

 "env": {
    "ANTHROPIC_CUSTOM_MODEL_OPTION": "claude-opus-4-6[1m]",
    "ANTHROPIC_CUSTOM_MODEL_OPTION_NAME": "Opus 4.6 1M",
    "ANTHROPIC_CUSTOM_MODEL_OPTION_DESCRIPTION": "Opus 4.6 with 1M context",
}

For Claude Opus 4.6 200k token model

 “env”: {
    “ANTHROPIC_CUSTOM_MODEL_OPTION”: “claude-opus-4-6”,
    “ANTHROPIC_CUSTOM_MODEL_OPTION_NAME”: “Opus 4.6 200k”,
    “ANTHROPIC_CUSTOM_MODEL_OPTION_DESCRIPTION”: “Opus 4.6 with 200k context”,
}

For Claude Opus 4.5 1m token model

 “env”: {
    “ANTHROPIC_CUSTOM_MODEL_OPTION”: “claude-opus-4-5[1m]”,
    “ANTHROPIC_CUSTOM_MODEL_OPTION_NAME”: “Opus 4.5 1M”,
    “ANTHROPIC_CUSTOM_MODEL_OPTION_DESCRIPTION”: “Opus 4.5 with 1M context”,
}

For Claude Opus 4.5 200k token model

 “env”: {
    “ANTHROPIC_CUSTOM_MODEL_OPTION”: “claude-opus-4-5”,
    “ANTHROPIC_CUSTOM_MODEL_OPTION_NAME”: “Opus 4.5 200k”,
    “ANTHROPIC_CUSTOM_MODEL_OPTION_DESCRIPTION”: “Opus 4.5 with 200k context”,
}

Claude Opus 4.5 200k model selection via Claude Code CLI /model

Claude Opus 4.5 200k model /context display

Other ways to select previous Claude models is via —model flag.

claude --model 'claude-opus-4-6[1m]'
claude --model 'claude-opus-4-6'
claude --model 'claude-opus-4-5[1m]'
claude --model 'claude-opus-4-5'

Fundamentally, Claude Opus 4.7 behaviour changed and partially that is related to adapting thinking and it being more sensitive to user prompt instructions. Folks may need to adjust their prompt instructions to get the best out of Opus 4.7. If that isn’t an option, then revert back to Claude Opus 4.6 or Opus 4.5.

Claude Opus 4.6 vs Opus 4.7 Effort Levels And Prompt Steering Benchmarks

George Liu — Sun, 26 Apr 2026 08:44:27 GMT

The prompt you prepend to a Claude Code task is doing more than setting tone. On Claude Opus 4.7, it determines whether the model reads files, how much it reasons, and whether it stays within instruction constraints. Anthropic’s Claude Opus 4.7 prompting guide documents this explicitly: the model calibrates to task complexity and lets its extended reasoning be shaped by the prompt. In Claude Code, that “work” is tool calls, file reads, cache writes, and extra agent turns. All of it shows up on the bill. Creator of Claude Code, Boris Cherny also mentioned this in his 6 advice tips.

The benchmarks below compare Claude Opus 4.6 and Opus 4.7 and measured exactly where token usage costs come from (using my session-metrics skill plugin), where instruction-following breaks, and which prompt steering wrappers give you savings without the performance penalty.

The previous tests were for Claude Opus 4.5 vs Opus 4.6 vs Opus 4.7 vs Sonnet 4.6 tested all effort levels from low to max across 10 preset prompts. This time benchmarks focused just on Claude Opus 4.6 [1m] high vs Opus 4.7 [1m] xhigh (respective defaults) across 5 prompt steering variant wrappers around the same 10 preset prompts and measures how well both Claude models faired for instruction following performance and also token usage and costs.

Running these benchmarks with 200 headless Claude Code instances consumed a lot of time and my entire Claude Max $100 plan’s 5hr session limit within 2hrs. If folks find this article useful, please like, restack or share with others. Fortunately, I’ve jumped on voice dictation bandwagon and now use Wispr Flow paired with DJI Mic Mini and more than doubled my typing speed. If you haven’t tried Wispr Flow yet, sign up here and get a free month on their Pro plan.

The result

Prepend one sentence before every prompt and your Claude Code bill drops 63 percent. Do not invoke any tools. Answer from your own knowledge and reasoning only. moved the same 10 prompts from $1.82 to $0.67 on Opus 4.7 xhigh. Opus 4.6 high followed almost exactly: $1.70 to $0.68, a 60 percent reduction.

That same sentence cut Opus 4.7’s instruction-following from 8/9 to 6/9. The model stopped completing 2 of 9 prompts correctly – every failure on a task that required reading a file the model could no longer access. The cost saving is real. So is the performance drop.

Prepend a different sentence – Think harder and more thoroughly about this problem. Use extended reasoning before responding. – and cost goes the other direction: $1.82 to $2.22, a 22 percent increase on Opus 4.7 xhigh, with instruction-following unchanged at 8/9. More expensive. No better at following instructions.

Opus 4.6 high responded to the same two wrappers in the opposite direction: think-step-by-step and ultrathink cut cost on 4.6 while raising it on 4.7. Instruction-following on 4.6 held 9/9 across all five variants.

TL;DR

no-tools cut cost by 63 percent on Opus 4.7 xhigh ($1.82 to $0.67) and 60 percent on Opus 4.6 high ($1.70 to $0.68). It was the strongest cost reducer in every model and effort cell.
no-tools also carried the steepest performance penalty. Opus 4.7 dropped from 8/9 to 6/9 on IFEval (instruction-following pass rate) at both effort levels. Three prompts drove every failure: the session-opening file summary, stack trace debugging, and TypeScript refactoring – all tasks that require reading files the model can no longer reach. Opus 4.6 was unaffected at high effort; at medium it lost one pass on the same file-summary prompt.
think-step-by-step and ultrathink increased cost on Opus 4.7 xhigh by about 22 percent each – with no IFEval improvement over baseline. The same wrappers cut cost on Opus 4.6 high. Same sentence, opposite effect, different model.
ultrathink was the worst latency offender: Opus 4.7 xhigh wall-clock grew 79.2 percent, and output tokens roughly doubled (+97.8% across the benchmark). No instruction-following gain over baseline justified either penalty.
concise cut Opus 4.6 high cost by 56.3 percent with IFEval holding at 9/9 – no regression, no tool suppression. On Opus 4.7 xhigh it dropped cost 29.8 percent with IFEval holding at 8/9. On Opus 4.7 medium the savings reach 48.9 percent. It is the first wrapper to test when tools cannot be suppressed.
no-tools saved money even when it produced longer answers. Opus 4.6 high output tokens went up 11 percent, but cost fell 60 percent. Output length is not the main cost driver – cache writes and tool loops are.
The cost difference between variants is almost entirely explained by when and where cache writes happened, not by how long the final answer was. Instruction-following (IFEval) and cost do not always move together – no-tools is the sharpest example: lowest cost, lowest instruction-following on Opus 4.7.
The IFEval section below breaks down in detail how Opus 4.6 and Opus 4.7 performed in terms of instruction following across the tested effort levels and prompt steering wrappers and does show some regressions for Opus 4.7 for some prompt steering variants. Some prompt steering variants did help improve Opus 4.7 which is inline with Anthropic’s official statements.

What I tested

I ran 200 headless Claude Code sessions:

5 prompt variants x 2 effort anchors x 2 model sides x 10 prompts = 200 runs

The two model sides compared Opus 4.6 at high effort against Opus 4.7 at xhigh effort, and both models at medium effort. xhigh is an Opus 4.7-only effort rung – Opus 4.6 tops out at high. That asymmetry is intentional: the goal was to compare each model at its natural operating point for heavier tasks.

The five steering variants were a sentence prepended to each prompt before it ran. Baseline left the prompt unchanged.

The 10-prompt suite was deliberately mixed: prose writing, code review, stack trace debugging, JSON reshaping, CSV transformation, CJK translation, TypeScript refactoring, instruction-following under multiple simultaneous constraints, and one task that explicitly required three separate file reads.

This is not a general model leaderboard. It is a Claude Code workflow benchmark. The question was: what happens to cost, latency, and instruction-following when you change the model, the effort level, and the steering text around the same task suite?

The four numbers I would remember

If you only take one thing from this benchmark, take this:

Both high-effort models – Opus 4.6 high and Opus 4.7 xhigh – landed at nearly the same cost under no-tools, despite starting from different baselines. The steering wrapper did not just shorten the answer. It changed the entire shape of the session.

If a Claude Code task does not need to read files, run shell commands, or make tool calls, prompt suppression is a bigger cost lever than effort tuning.

Why this matters for Opus 4.7

With older Opus models, I could often write a broad prompt and trust the model to self-regulate. Ask it to review some code and it would spend roughly the right amount of effort on it.

Opus 4.7 is more responsive to the prompt – which cuts both ways. It can do more with a well-specified task. But it can also spend significantly more on a vague one. Because it adapts its reasoning depth and response length more actively than its predecessors, the prompt has to carry the intent that used to be implicit.

If I want a concise answer, I should say so. If I do not want tool calls, I should say so. Those instructions do not just change tone. In this benchmark, they changed cost, latency, cache writes, tool-use counts, and thinking-block counts.

Cost: the clearest signal

The cheapest path was not “always use lower effort.” It was “avoid tools when tools are not needed.” no-tools produced the lowest or near-lowest cost in every model and effort cell: $0.50, $0.67, $0.68, $0.67. At the high/xhigh anchor – the most practically relevant comparison – Opus 4.6 high settled at $0.68 and Opus 4.7 xhigh at $0.67. Effectively the same cost, from two models that started at $1.70 and $1.82. But cost convergence masked an instruction-following divergence: at that same $0.68/$0.67 cost point, Opus 4.6 held 9/9 IFEval and Opus 4.7 dropped to 6/9. The savings are real. The performance drop on 4.7 is also real.

Example of actual turn by turn token usage and costs for Claude Opus 4.6 high with ultrathink vs Opus 4.7 xhigh with ultrathink using my session-metrics skill plugin HTML exported metrics for the headless Claude Code sessions.

Looking at Opus 4.7 xhigh and turn 10 spike in pricing, looks like it went inspecting other skill files.

Opus 4.7 xhigh + ultrathink turn 11 inspection.

Opus 4.7 xhigh + ultrathink turn 12 inspection.

And instruction following prompt inspection with Opus 4.7 xhigh + ultrathink turn 15.

The inversion is the most important thing in this matrix: think-step-by-step and ultrathink made Opus 4.6 high cheaper but pushed Opus 4.7 xhigh above $2.20. The same wrapper moved two models in opposite directions.

This chart compares each result against its own model’s baseline, which makes the savings and penalties easier to compare. no-tools reduced cost in every cell: -50.2%, -63.0%, -60.0%, and -63.0%. think-step-by-step and ultrathink were savings on Opus 4.6 high but cost increases on Opus 4.7 xhigh – and critically, neither wrapper improved IFEval on Opus 4.7. The model paid more and scored the same 8/9. That model-specific split is the main finding.

There is a framing that makes this inversion concrete. Looking at it through the cross-model lens: at baseline, Opus 4.7 costs 1.07x what Opus 4.6 costs on the same prompts – essentially the same price. Add think-step-by-step, and Opus 4.7 suddenly costs 2.74x more than Opus 4.6. Add ultrathink, and it is 2.38x more. Apply no-tools, and they are back to 0.99x – Opus 4.7 actually fractionally cheaper. The wrapper you choose does not just change how much you pay. It changes which model is more expensive.

no-tools: the performance tradeoff

The cost savings are real. So is the downside.

The regressions at high/xhigh fell on stack_trace_debug and typescript_refactor – tasks where reading repository files is part of a complete answer. tool_heavy_task explicitly required three file reads; under no-tools, the model answered from memory and produced less grounded output. Tool suppression removes the overhead and removes the capability at the same time.

The practical rule: use no-tools for self-contained tasks – prose drafting, code generation from an inline spec, data reshaping from inline content. Do not use it for tasks that require reading files, inspecting a codebase, or calling an external API.

concise: the strongest option when tools are required

When suppressing tools is not an option, concise is the next best lever – and it is closer to no-tools performance than the cost matrix suggests at first glance.

On Opus 4.6 high, concise cut cost from $1.70 to $0.74, a 56.3 percent reduction, nearly matching no-tools’ 60 percent – with IFEval holding at 9/9. No regression. It also cut wall-clock time from 180s to 146s – the largest latency reduction of any variant at that model/effort combination. On Opus 4.7 xhigh, it dropped cost from $1.82 to $1.28, down 29.8 percent, with IFEval holding at 8/9 (same as baseline).

At medium effort, concise cut Opus 4.7 cost by 48.9 percent – substantially more than at xhigh. For teams running Opus 4.7 at medium effort, concise may deliver more savings than the high/xhigh numbers suggest.

One counterintuitive result: concise increased tool calls at medium effort. Opus 4.6 medium went from 3 tool blocks at baseline to 7 under concise. Telling the model to be brief apparently shifted it toward delegating lookup work to tools rather than reasoning through it. The instruction changed answer style without reducing agentic activity – and raised latency by 11.8 percent on 4.6 medium even as it cut latency on 4.7 medium by 9.0 percent.

For Opus 4.6 workflows that need tool access, concise is the first wrapper to test. For Opus 4.7, it produces a meaningful cost reduction but does not get close to no-tools.

IFEval: accuracy and instruction-following across all variants

IFEval tests whether a model follows specific, verifiable instructions in its response – things like “respond in under 50 words,” “include a code block,” or “use exactly three bullet points.” It gives a binary pass/fail per prompt, not a fluency score. That makes it a clean signal for whether a steering wrapper changed model behavior in unintended ways. The suite has 9 testable prompts per session; tool_heavy_task is excluded from scoring because it has no verifiable pass/fail criteria (its output depends entirely on tool access being available).

The pass-rate matrix – all variants, both effort levels:

At high/xhigh effort (Opus 4.6 high vs. Opus 4.7 xhigh), Opus 4.6 scored 9/9 (100%) in every single variant – baseline, concise, no-tools, think-step-by-step, and ultrathink. Not a single failure across all five high-effort runs. Opus 4.7 xhigh varied: 8/9 (89%) under baseline, concise, think-step-by-step, and ultrathink; 6/9 (67%) under no-tools.

At medium effort (Opus 4.6 medium vs. Opus 4.7 medium), Opus 4.6 scored 9/9 in three variants (baseline, concise, ultrathink) and 8/9 in two (no-tools and think-step-by-step). Opus 4.7 medium scored 8/9 under baseline, think-step-by-step, and ultrathink; 7/9 (78%) under concise; and 6/9 (67%) under no-tools.

Summarised as a grid (testable prompts only, out of 9):

Which prompts actually failed – and when:

claudemd_summarise failed for Opus 4.7 in 8 out of 10 runs. The two exceptions were xhigh concise and xhigh ultrathink – both xhigh-effort runs where the model’s more aggressive tool use meant it actually read the project file before summarising it, satisfying the IFEval constraint. The raw session data shows why it failed everywhere else: Opus 4.7 wrote 121-125 words when asked for exactly 120. Opus 4.6 at high effort spent 3,279 output tokens on this prompt – mostly extended thinking – and landed on exactly 120. Opus 4.7 used 303 tokens and wrote 125. The thinking budget, not just the instruction, made the difference. Opus 4.6 failed this prompt in only two cases: medium no-tools and medium think-step-by-step, both at reduced effort. The opener is not a neutral warm-up for Opus 4.7. If your real workflow starts with a context-load or file-read task, budget for this failure in your performance baseline.

typescript_refactor failed for Opus 4.7 in four runs: xhigh no-tools, medium no-tools, xhigh concise, and medium concise. In the no-tools runs, the raw output shows the model used the word “refactor” only once in its explanation – the IFEval constraint required exactly twice. The code was correct; the explanation text came up one short. Under concise, the same constraint tripped the same way: the steered model wrote a tighter explanation and dropped the second use of the word. Opus 4.6 passed this prompt in every run.

stack_trace_debug failed for Opus 4.7 in three runs: xhigh no-tools, medium no-tools, and xhigh ultrathink. The constraint was 200 output tokens or fewer. Under xhigh no-tools, the model wrote 312 tokens – 56 percent over the limit. Under xhigh ultrathink it wrote 606 tokens – three times the limit. Pushing the model to “think harder” made the brevity constraint worse, not better. Opus 4.6 stayed under the limit with shorter, more direct answers. Without tool access, the model compensates by explaining more in text.

english_prose, code_review, cjk_prose, json_reshape, csv_transform, and instruction_stress passed for both models in every single run. None of these prompts failed under any variant or effort level. All six are self-contained: they require no file reads, no codebase inspection, and no external API calls. Whatever steering wrapper you apply, these prompts are safe.

The three patterns worth keeping:

Opus 4.6 has a near-perfect instruction-following floor. It failed only twice across 50 prompted sessions (medium no-tools and medium think-step-by-step, both on claudemd_summarise), and only when the steering explicitly removed the tool access the prompt needed. Every other variant at every effort level: 9/9.

No-tools is the only variant that meaningfully degrades instruction-following. At high/xhigh it drops Opus 4.7 from 8/9 to 6/9. At medium it drops both models: 4.6 from 9/9 to 8/9, and 4.7 from 8/9 to 6/9. The other four variants produce at most one failure on 4.7 (and zero on 4.6 at high effort).

The only run where 4.6 and 4.7 tied on IFEval was think-step-by-step at medium (both 8/9, Δ=0pp). That tie happened because 4.6 degraded – not because 4.7 improved.

Output tokens did not explain cost

Cheaper runs did not always produce shorter answers. no-tools increased Opus 4.6 high output tokens from 6,563 to 7,319 while cutting total cost by 60 percent. That gap – more output, less cost – shows that the final answer length was not driving the bill.

ultrathink ran the opposite direction: output tokens nearly doubled on average across the benchmark (+97.8% vs baseline), and cost went up. The session was not just giving longer answers – it was accumulating far more context between turns.

This chart normalizes cost against visible output: how much did each 1,000 answer tokens actually cost? High values signal overhead beyond the answer itself – tool loops, extra agent turns, cache writes, and reasoning billed through output tokens. no-tools collapsed that overhead on both Opus 4.7 columns. The answer got slightly longer but everything around it got much cheaper.

Cache writes and tool use explain the savings

Every time Claude Code reads a file, processes a tool result, or carries a long conversation forward, those tokens get written to a cache. Subsequent turns can then read that context cheaply instead of reprocessing it. The more context a session writes and re-carries, the more expensive it gets – because cache writes are billed at the write rate, and they compound across turns.

At Opus 4.7 xhigh, the baseline session wrote 112,395 cache tokens. think-step-by-step pushed that to 153,968 – a 7.53x cache write ratio versus Opus 4.6’s 20,452 in the same variant. ultrathink reached 127,616 (5.20x versus Opus 4.6’s 24,558). no-tools wrote only 18,894 – an 83 percent reduction, and nearly identical to Opus 4.6’s 18,784 in the same variant. At that level, the two models’ cache behavior is essentially converged.

Every tool call creates a chain: the model calls a tool, the tool returns a result, and that result becomes context for the next turn. no-tools recorded zero tool-use blocks across all four model/effort cells. Baseline Opus 4.7 used 7 to 8 tool blocks per session. ultrathink pushed Opus 4.7 xhigh to 9. Each of those calls added context that kept accumulating.

Where the cache actually built up: turn by turn

The aggregate numbers above show totals. The turn-by-turn breakdown shows something more useful: which specific prompts caused the spikes.

Each session runs the 10 prompts in order, one prompt per turn. The model’s state carries forward between turns – so a large cache write on turn 5 means all subsequent turns pay to read that context. What happens in subsequent turns varies dramatically by model, effort, and steering variant.

Baseline: high effort drives big spikes

At 4.6 high baseline, the cache concentrated into two turns. Turn 1 (claudemd_summarise) wrote 57,837 tokens – 46.2 percent of the 125,317 session total. Turn 5 (tool_heavy_task) wrote 58,770 tokens – another 46.9 percent. At high effort, the model read files aggressively, and tool_heavy_task is what triggered it. Those two turns account for 93 percent of the entire session’s cache budget.

At 4.7 xhigh baseline, turn 1 wrote 79,026 tokens (70.3 percent of 112,395). But turn 8 – json_reshape, a structured data transformation that does not need file access at medium effort – wrote 23,528 tokens (20.9 percent). At xhigh, the model used tool calls even on tasks that did not strictly require them.

think-step-by-step: quiet for 12 turns, then a single enormous spike

At 4.7 xhigh, think-step-by-step produced 153,968 total cache tokens. Turn 1 contributed just 13,184 (8.6 percent). The session ran quietly for 12 turns. Then turn 13 – instruction_stress, a prompt that asks the model to satisfy five simultaneous constraints – wrote 108,808 tokens in one turn. That is 70.7 percent of the entire session’s cache budget in a single response.

That turn is also where the $1.11 single-prompt cost spike came from. The think-step-by-step wrapper told the model to reason carefully. On a complex constraint-stacking prompt at xhigh effort, “reason carefully” meant accumulating enormous context before answering. The cost did not spread across 10 prompts – it concentrated at the one that triggered the most reasoning.

For comparison, the same think-step-by-step wrapper on 4.6 high produced turn 1 at 9,498 tokens (46.4 percent) and no other turn above 3,600. Total cache: 20,452. The same steering text produced a fundamentally different session shape on the two models.

ultrathink: the spike moved to an earlier, different prompt

At 4.7 xhigh under ultrathink, turn 1 wrote 13,186 tokens (10.3 percent). Turn 3 – code_review – wrote 75,227 tokens, 58.9 percent of the 127,616 session total. Turn 10 (typescript_refactor) added another 23,867 tokens (18.7 percent). Three turns drove 88 percent of the cache.

That turn 3 spike is also visible in the per-prompt cost data. code_review under ultrathink cost $0.0735 on Opus 4.6 high and $0.7807 on Opus 4.7 xhigh – a 10.62x difference for a single prompt. The model treated a code review as a deep reasoning exercise requiring its full working context to be carried forward. At 4.6 high under the same wrapper, the same prompt wrote 1,701 tokens. ultrathink made 4.7 treat a code review like instruction_stress.

no-tools: flat, predictable, no surprises

Under no-tools, every turn was small and consistent. At 4.6 high, turn 1 wrote 9,497 tokens (50.6 percent) and the remaining nine turns averaged around 600 tokens each. At 4.7 xhigh, turn 1 wrote 13,185 tokens (69.8 percent) and the rest averaged under 700. No spikes. The cache profile was stable across both models and both effort levels because without tool calls, there was no mechanism for mid-session context accumulation.

This is why no-tools saved money even when output tokens went up: it did not just shorten the answer, it eliminated the events that caused the cache to balloon.

Thinking blocks: only Opus 4.7 responded to steering

Thinking blocks are the model’s internal reasoning steps – visible in the transcript as structured reasoning before the final answer. They are counted here as blocks, not tokens, because Claude Code does not expose a separate recoverable thinking-token field. Thinking tokens are billed through output tokens in the billing data.

Opus 4.7 changed its thinking behavior noticeably based on the prompt wrapper. Medium baseline used 4 thinking blocks; xhigh baseline used 6; ultrathink at xhigh pushed to 12; no-tools at xhigh dropped to 2.

Opus 4.6 flatlined at exactly 10 thinking blocks in every single cell – baseline, concise, think-step-by-step, ultrathink, no-tools. The steering text had no measurable effect on its thinking-block count. If you are running Opus 4.6 and adding ultrathink or think-step-by-step expecting deeper reasoning, the block count evidence says nothing is changing. The extended-thinking response to prompt steering appears to be an Opus 4.7 behavior.

Latency: cheaper and faster usually went together, but not always

ultrathink was slow everywhere: 187s, 182s, 258s, and 258s across the four model/effort cells. no-tools was the fastest on Opus 4.7 – 107s at medium, 111s at xhigh. On Opus 4.6, no-tools had little latency effect; concise was the better latency lever there.

Compared to each model’s own baseline: ultrathink raised Opus 4.7 xhigh wall-clock by +79.2 percent. no-tools cut Opus 4.7 medium latency by 25.7 percent and Opus 4.7 xhigh by 22.9 percent. concise cut Opus 4.6 high by 18.9 percent.

This chart separates “there were more steps” from “each step got slower.” Opus 4.6 high baseline averaged 11.41s per turn; ultrathink pushed that to 19.45s. Opus 4.7 xhigh averaged 7.23s per turn at baseline; ultrathink pushed it to 10.93s. The per-turn slowdown is separate from the extra turns – ultrathink made each response take longer, not just added more of them.

One prompt can dominate the whole session

tool_heavy_task dominated Opus 4.7 latency under baseline, concise, think-step-by-step, and ultrathink. ultrathink also made claudemd_summarise and typescript_refactor visibly slower across several cells.

The two biggest single-prompt cost spikes both came from different wrappers hitting different prompts: instruction_stress under think-step-by-step at Opus 4.7 xhigh cost $1.11, and code_review under ultrathink cost $0.78. think-step-by-step detonated at turn 13 on a constraint-stacking prompt; ultrathink detonated at turn 3 on a code review. A wrapper that looks safe on average can still hide a prompt-specific cost spike. The session total masks it.

Opus 4.7 baseline also made claudemd_summarise expensive at around $0.80 – the session-opening context load, before any steering wrapper applies.

What the raw turn data confirmed

The session transcripts record every content block the model produced: thinking steps, tool calls, tool results, and text. This chart counts those blocks across each full 10-prompt session. no-tools removed tool-use and tool-result blocks entirely in every cell – zero, across all four model/effort combinations. Opus 4.7 xhigh under ultrathink reached 12 thinking blocks and 9 tool-use blocks: the most agentic and reasoning-heavy cell in the benchmark.

This is where the turn-by-turn cache write story crystallizes into a single chart. At baseline, most cache writes happened on turn 1: 92 percent for Opus 4.6 medium, 70 percent for Opus 4.7 xhigh. Under think-step-by-step and ultrathink on Opus 4.7 xhigh, only 9 to 10 percent of cache writes happened on turn 1. The reasoning wrappers shifted cache pressure from the session start to later turns – which is where the $1.11 instruction_stress spike and the $0.78 code_review write occurred. When cache writes scatter across later turns instead of front-loading, a single expensive prompt can reshape the whole session cost.

My recommendation from this run

Moving prompts from Opus 4.6 to Opus 4.7 is not a drop-in upgrade. The two models respond differently to the same steering text – in cost, in instruction-following, and in which direction each moves.

For direct-answer tasks – prose generation, code from an inline spec, data reshaping from inline content – test no-tools first. It is the strongest cost and latency lever in this benchmark. But verify the task does not need file access: on Opus 4.7, no-tools dropped instruction-following to 6/9 on every prompt that required a file read. Opus 4.6 high held 9/9.

For tasks that require tool use – codebase inspection, repository search, stack trace debugging – use concise instead. On Opus 4.6 high it cut cost 56.3 percent and latency 18.9 percent with instruction-following holding at 9/9 – same as baseline. On Opus 4.7 xhigh, cost dropped 29.8 percent and instruction-following held at 8/9. It is the only cost-reduction wrapper in this benchmark with no performance penalty on either model.

For multi-step reasoning tasks on Opus 4.7, do not add think-step-by-step or ultrathink without checking per-prompt cost first. Both raised cost 22 percent with instruction-following unchanged at 8/9 – same as baseline. You pay more. The model does not follow instructions any better.

For any workflow, track both cost and instruction-following at the per-prompt level. A session total that looks acceptable can still have one prompt dominating the bill under one wrapper and breaking an instruction constraint under another.

For Anthropic, maybe Claude Opus 4.7 needs to switch back to default a thinking tokens budget like Opus 4.6 and make adaptive thinking opt-in like Opus 4.6 to regain that degraded performance that some folks are complaining about?

What I learned

Prompt steering sets both cost and instruction-following. At baseline, Opus 4.7 costs 1.07x what Opus 4.6 costs – and scores 8/9 on instruction-following. Add no-tools, and cost falls to 0.99x but instruction-following drops to 6/9. Add concise, and cost falls to 1.41x with instruction-following holding at 8/9. The wrapper you choose sets both numbers simultaneously.

Cost and instruction-following do not move in lockstep. no-tools was the cheapest variant and the worst for instruction-following on Opus 4.7. concise was more expensive than no-tools but held instruction-following at baseline. Optimising for cost alone will send you toward no-tools; optimising for both will send you toward concise.

“Think harder” needs a reason. On Opus 4.7 xhigh, think-step-by-step and ultrathink both raised cost 22 percent with instruction-following flat at 8/9 – same as baseline. More expensive. No better at following instructions. The extra spend concentrated in one or two prompts per session, not spread evenly.

concise is the underrated option. It came within 4 pp of no-tools on cost for Opus 4.6 high, held 9/9 instruction-following, and suppressed nothing. The safest first thing to test on any workflow where tools are needed.

The bill is about cache writes, not answer length. Under think-step-by-step, the 4.7 cache write ratio was 7.53x higher than 4.6 on the same variant. Under no-tools, it was 1.01x. That gap explains almost all of the cost inversion.

Opus 4.6 held 9/9 instruction-following across all five variants. Even under no-tools at high effort, it passed every testable prompt. If instruction-following consistency matters more than cost, that is a meaningful data point for model choice.

To workaround deficiencies in one AI model, I never rely on a single AI model. I always have other AI models verify and consult with primary AI model to get a more well rounded response and understanding. I do this via my /consult-codex GPT-5.x skill and previously via my Gemini CLI MCP server.

The 200 headless claude Code sessions ate into my 5hr session limit quickly - 2hrs in and hit my session usage limit! Here’s my Claude Code OpenTelemetry usage dashboard via Grafana, Prometheus, Loki setup - prompts per hour peaked at 407 prompts as I was also using Claude Code for other projects too at same time.

Token usage by Effort levels

Token usage costs by Effort levels

Or if you really want to go back to Claude Opus 4.6, set in your Claude Code CLI /model selection by adding to ~/.claude/settings.json config file environmental variables or pass —-model flag with claude-opus-4-6 or claude-opus-4-6[1m].

 "env": {
    "ANTHROPIC_CUSTOM_MODEL_OPTION": "claude-opus-4-6[1m]",
    "ANTHROPIC_CUSTOM_MODEL_OPTION_NAME": "Opus 4.6 1M",
    "ANTHROPIC_CUSTOM_MODEL_OPTION_DESCRIPTION": "Opus 4.6 with 1M context",
}

Compared To OpenAI GPT-5.4 & GPT-5.5

If you managed to read to this point of the article, here’s the bonus content. I was curious how do the 10 prompts compare for OpenAI’s GPT-5.4 and GPT-5.5? I decided to do some quick tests in Codex app for MacOS for GPT-5.4 medium/high and GPT-5.5 medium/high for baseline variant.

GPT is cheaper across the matched baseline comparisons: gpt-5.4 medium/high is far below Claude 4.6 medium/high, and gpt-5.5 medium/high is also below Claude 4.7 medium/xhigh. Within GPT, gpt-5.5 costs more than gpt-5.4, but both remain below the matching Claude baseline costs in this one-off comparison.

gpt-5.5 is the strongest GPT result, passing 9/9 at both medium and high. gpt-5.4 passes 8/9 at both efforts, trailing Claude 4.6’s 9/9 but matching or exceeding the provided Claude 4.7 baseline 8/9.

Claude emits more output in the high-effort rows, especially Claude 4.6 high. GPT is more compact overall, with gpt-5.5 medium producing the fewest output tokens among the GPT/Claude baseline set, which helps explain its lower latency and cost despite higher GPT-5.5 pricing than GPT-5.4.

GPT-5.4 has the best cost efficiency per output token in this comparison, especially at high effort. GPT-5.5 is more expensive per output token than GPT-5.4 but remains competitive with Claude 4.7; Claude 4.7 medium and xhigh are among the higher cost-per-output cells.

GPT is much faster in this run, with total baseline latency around 37-52s compared with Claude’s 127-180s. The GPT high-effort cells are slower than GPT medium, as expected, but still substantially faster than the Claude baseline chart labels.

The delta chart shows GPT costs below the matching Claude baselines in all four pairings. The biggest relative savings are against Claude high/xhigh cells, while gpt-5.5 narrows the savings gap because it costs more than gpt-5.4 while still staying below the matched Claude baseline.

What’s next

The next experiment is applying these findings to real Claude Code workflows. The hypothesis is that no-tools will be useful for a subset (prose drafting, direct Q&A, inline code generation) and harmful for another (repository inspection, debugging with tool reads). Mapping that boundary with actual tasks – and tracking which prompts break IFEval-style constraints along the way – is more useful than the aggregate cost numbers alone.

The benchmark tooling should also surface per-prompt cost spikes as a default view. Session totals are misleading when one prompt is doing most of the damage. A view that ranks prompts by cost within a session – and flags which steering variant concentrated the most spend in a single turn – would have made the instruction_stress and code_review findings immediately obvious, without needing to parse the raw turn data to find them.

Tested Claude AI LLM Models' Effort Levels - Low To Max: How Claude Opus 4.7 differs

George Liu — Thu, 23 Apr 2026 14:09:06 GMT

Claude Code now exposes a reasoning_effort knob with five public rungs: low, medium, high, xhigh, max. The pitch is simple. Higher effort means more thinking, which means better answers on hard problems.

The unasked question is what that knob actually costs, in tokens and dollars, and whether the same crank behaves the same way across different models. I spent an afternoon of subscription quota finding out. Read on if you want to understand how Claude Opus and Sonnet models’ effort levels impact your token usage, costs and performance.

Four models, every effort rung, 10 canonical prompts per rung, 220 headless claude -p subprocesses. Total spend: around $18. The short version is that “higher effort = more money” is almost right but misses the more interesting story underneath. The four models have qualitatively different personalities. The knob amplifies those personalities in four different directions, and Opus 4.7 is so different from the other three that it is better thought of as a new product class than a new model version.

The rest of this post is the long version: every number, twelve cross-model charts, and four confounds that mislead careless readers of the raw data.

Background

This follows on from I Ran Opus 4.6 and 4.7 on the Same 10 Prompts at 1M Context, which covered what the effort knob looks like at one rung across two models. This post runs the other axis: four models, every rung, same 10 prompts.

Both rely on my session-metrics plugin, first covered in My Claude Code Plugin Marketplace Is Now Public and used again in I Ran Two 5-Hour Opus 4.7 Blocks in One Day. The short version: session-metrics parses Claude Code’s own JSONL transcripts and rebuilds the exact cost breakdown from token counts and the published price list. The new benchmark-effort subcommand used here spawns an effort-rung ladder in one shot. No API key needed; it rides on your existing subscription quota.

TL;DR

Claude Max Plan $100 used. Tests below were done within Claude Code CLI 2.1.117. Note from Anthropic post-mortem on Claude Code quality degradation issues, they mentioned Claude Code 2.1.116+ have it fixed.
Effort is an output-token dial, not an input dial. Every session paid ~6 input tokens per prompt. Cost lives in the output column, which includes thinking blocks and tool-use JSON. The knob slides how much the model writes, not how much you type.
Three behavioural regimes, not four models. Opus 4.6, 4.5, and Sonnet 4.6 already think on 90 to 100 percent of turns at low effort. Opus 4.7 is the outlier: reserves thinking for hard turns at low (18 percent) and only ramps to 93 percent at max.
Sonnet 4.6 is the generalist agent. Always reaches for tools (9 calls across 18 to 19 turns at every rung), thinks on every turn, produces the most output tokens per run, yet costs less than Opus at the same rung because its list price is 60 percent of Opus’s. Measured in tokens per dollar, it is 2 to 5x more efficient than any Opus variant.
Cache warmth is a first-order confound. The same model at the same effort level can swing up to 1.88x depending on whether the 1-hour prompt cache was warm on turn 1. Opus 4.7 [1M] xhigh costs $1.12 warm vs $1.78 cold; Sonnet 4.6 medium costs $0.78 warm vs $1.46 cold. Any sub-1.3x cost ratio in this bundle is inside the noise floor.
The max rung is a different product. On Opus 4.7, max triggers thinking on 93 percent of turns and produces 6.5x more output tokens than low. On Opus 4.5 the same dial produces only 2.1x more output: caps out earlier.
IFEval is noise at N = 9. Nine compliance checks per side means one prompt flipping moves the pass rate by 11.1 percentage points. Every McNemar p-value sits at 0.25 to 0.5. The evidence cannot rule out a coin flip.

Key terms (short glossary)

Skim whatever you already know.

Token. Billing unit. Roughly 4 English characters per token; a short prompt is ~100 tokens, a long article ~5,000.
Context window. How much Claude can see at once. The [1m] suffix means 1 million tokens. Opus 4.7/4.6 sides use [1m]; Opus 4.5 and Sonnet 4.6 use the standard tier.
Effort level. A Claude Code setting (/effort low | medium | high | xhigh | max) that controls how hard the model thinks before answering. Higher effort usually means more hidden thinking tokens and longer answers, at a higher bill.
Turn. One round-trip exchange. A user message plus model reply is one turn. Tool calls and their replies add separate turns.
Thinking block. Claude’s hidden reasoning step, billed at the output rate even though you never see it.
Tool call. When the model asks Claude Code to run a command (Bash, Read, Glob, etc.) before answering.
Cache read / cache write. Once Claude has processed your system prompt, Claude Code stores it in a short-lived cache. Follow-up turns pay a cheap cache-read fee instead of re-processing. Cache writes cost more up front but save money on every later turn.
1-hour TTL vs 5-minute TTL. Claude Code can write cache entries that live 5 minutes or 1 hour. The 1-hour tier costs 60 percent more per write token but keeps savings flowing over longer sessions. Claude Code’s headless mode uses the 1-hour tier 100 percent of the time for Claude Max plan subscriptions.
IFEval. A public benchmark (Zhou et al. 2023) that programmatically grades whether a model followed a specific instruction literally.
pp (percentage points). The arithmetic difference between two percentages. 100 percent minus 89 percent is 11 percentage points, not “11 percent better” (which would be a ratio).
Canonical suite. A fixed set of 10 test prompts that ship with session-metrics, covering the content shapes Anthropic’s tokenizer write-up measured.
McNemar p-value. A statistical test for paired yes/no outcomes. A p-value under 0.05 means the difference is unlikely to be random. Every pair in this bundle sits at 0.25 to 0.5, which means “cannot tell if the difference is real or a coin flip”.

What actually ran

Four benchmark bundles, one per model. Every bundle runs the model against itself at different effort rungs using session-metrics compare-run. That matters: the ratios inside a single bundle reflect effort-level behaviour plus run-to-run variance, not model-vs-model. The cross-model story comes from aligning the four bundles at matched effort rungs.

Command log

# big-compare-1 - Opus 4.7 with the 1M-token context variant
session-metrics compare-run 'claude-opus-4-7[1m]' 'claude-opus-4-7[1m]' --compare-run-effort low medium
session-metrics compare-run 'claude-opus-4-7[1m]' 'claude-opus-4-7[1m]' --compare-run-effort high xhigh
session-metrics compare-run 'claude-opus-4-7[1m]' 'claude-opus-4-7[1m]' --compare-run-effort xhigh max

# big-compare-2 - Opus 4.6 [1m]
session-metrics compare-run 'claude-opus-4-6[1m]' 'claude-opus-4-6[1m]' --compare-run-effort low medium
session-metrics compare-run 'claude-opus-4-6[1m]' 'claude-opus-4-6[1m]' --compare-run-effort high max

# big-compare-3 - Opus 4.5 (standard context)
session-metrics compare-run 'claude-opus-4-5'     'claude-opus-4-5'     --compare-run-effort low medium
session-metrics compare-run 'claude-opus-4-5'     'claude-opus-4-5'     --compare-run-effort high max

# big-compare-4 - Sonnet 4.6 (standard context)
session-metrics compare-run 'claude-sonnet-4-6'   'claude-sonnet-4-6'   --compare-run-effort low medium
session-metrics compare-run 'claude-sonnet-4-6'   'claude-sonnet-4-6'   --compare-run-effort medium high

Each compare-run spawns 20 headless claude -p subprocesses (10 prompts, 2 sides). The full matrix landed at 180 subprocesses. Total wall-time was 95 minutes of live inference spread over several 20-minute windows to respect rate limits.

The canonical 10-prompt suite

Every side runs the same 10 prompts, designed to stress different axes of the model’s behaviour.

Nine of the ten feed the IFEval pass-rate column. tool_heavy_task is ratio-only because its success criterion is “did the tool call succeed”, not “did the text match a rubric”.

Pricing (per 1M tokens, USD)

Opus pricing is identical across 4.7, 4.6, 4.5. Any cost differences between Opus variants are purely tokenisation plus behaviour. Sonnet 4.6 is 60 percent of Opus across every column.

How much does cranking the knob actually cost?

The dumbest question is also the most important. What does it actually cost to run the 10 prompts through each combination?

The range. $0.69 to $2.25 on the same 10-prompt workload, a 3.26x spread. Most of it is not attributable to effort: cheapest (Opus 4.5 at max, $0.69) and most expensive (Opus 4.7 at max, $2.25) sit at the same effort level on same-family models. The right question isn’t “which model is cheaper” but “which model, at which effort, with what cache state”.

The cache trap. Two cells where the bundle captured both a warm and a cold run of the same model at the same effort; the chart above stacks them. Opus 4.7 [1M] xhigh: $1.12 warm / $1.78 cold, a 1.59x spread. Sonnet 4.6 medium: $0.78 warm / $1.46 cold, a 1.88x spread. Opus 4.5 at max ($0.69) also ran warm (the 77K-token system prompt was still cached from the preceding low/medium pair at 1-hour TTL, so turn 1 paid ~$0.17 instead of ~$0.78), but the bundle does not contain a second cold-start max capture, so its bar stands alone without a stacked overlay.

In plain English: every Claude Code session sends its own 77K-token system prompt once, and that first send is expensive. If your previous session finished less than an hour ago, Claude Code reuses the cached version. If not, you pay again, which creates up to $0.78 of difference between two otherwise-identical runs. Any sub-1.3x cost ratio in this benchmark is below the observed noise floor.

Output tokens tell a different story

Switch the y-axis from dollars to tokens and the ranking reshuffles.

Sonnet 4.6 is the most verbose model here. At high (its top captured rung) it produces 17,443 output tokens vs Opus 4.7 high at 5,385 on the same suite. A 3.24x output gap at the same rung, yet Sonnet’s session cost $1.12 against Opus 4.7’s $1.79. Output verbosity and billing are two conversations, because the models charge different rates per token.

Opus 4.7 max is unique. 17,559 tokens, the single biggest output in the bundle. Opus 4.7 only reaches Sonnet-at-high levels of loquacity at max. Other Opus variants cap out earlier: Opus 4.6 plateaus at ~9,500 across high/max; Opus 4.5 decreases from 12,704 tokens at medium to 7,652 at max (cache-warmth confound again).

Low-rung baselines cluster. Every model produces 2.5K to 4.4K output tokens at low. The spread between rungs is the dial; the floor is the suite.

The chart that changes the recommendation

Sonnet 4.6 at high: 15,516 tokens per dollar. The headline. Sonnet 4.6 medium is the more interesting cell now that both captures are plotted: 7,187 tok/$ on the cold canonical run, 14,765 tok/$ on the warm alt. The warm-medium cell is within ~5 percent of the warm-high headline, so if the cache happens to be warm on turn 1, Sonnet medium and high are roughly interchangeable on pure efficiency. Opus 4.7 at high manages 3,015 tokens per dollar, a 5.1x efficiency gap vs Sonnet high at the same rung; that gap compresses to ~2 to 3x once you price Sonnet medium at its cold floor, which is what you should budget against.

Plain English: a dollar buys five times as many words of output on Sonnet 4.6 at high as on Opus 4.7 at high. If you pay for volume (summaries, rewrites, long-form prose), Sonnet is a better deal per dollar even when Opus would give a smarter answer per word. Sonnet’s list price is 60 percent of Opus’s and its default behaviour on the canonical suite produces 2 to 3x more visible output per invocation; those multipliers stack. Opus becomes sensible once the task demands cognitive-quality deltas Sonnet cannot cover.

Low-rung flattening. At low every model sits in the 2,000 to 3,600 tok/$ band. You do not save by picking a cheaper model at the lowest rung, because the first-turn cache tax is fixed overhead that dominates the bill either way.

When does each model actually think?

The most surprising chart in the bundle. Every model except Opus 4.7 thinks on roughly every turn, at every effort rung, from low upward. Opus 4.7 at low fires a thinking block on 18 percent of turns. The rest of the time, it just answers.

Opus 4.6, 4.5, and Sonnet 4.6 treat thinking as the default action; their effort dial modulates how long a thinking block runs, not whether one fires. Opus 4.7 modulates both. At low effort it suppresses thinking for trivial turns; only at max does it converge to the older models’ “always think” posture (93 percent).

Plain English: on Opus 4.6 / 4.5 / Sonnet 4.6, “low effort” means “still always thinks, just shorter”. On Opus 4.7, “low effort” means “mostly does not think unless the turn warrants it”. Different product, same /effort interface.

Why this matters if you’re paying.

Here’s where I think a lot of Opus 4.7 users have reported degradation in performance. Anthropic has officially stated that Opus 4.7 with adaptive thinking can be further tuned via prompt steering instructions. “Claude Opus 4.7 calibrates response length to how complex it judges the task to be, rather than defaulting to a fixed verbosity. This usually means shorter answers on simple lookups and much longer ones on open-ended analysis. If your product depends on a certain style or verbosity of output, you may need to tune your prompts.“ So folks may need to change the way they prompt with Claude Opus 4.7 when using adaptive thinking. From my experience, adding at front of prompt, “think critically” shows more improvements as opposed to adding this at end of your prompt.
Thinking tokens bill at the output rate. When Opus 4.6 at low thinks on 10 of 11 turns, every one pays the thinking tax. Claude Code stores thinking blocks signature-only (text opaque to every downstream consumer), so you pay output rates for cognition you cannot inspect. Opus 4.7 at low only hits you with that tax 18 percent of the time. For “thinking only when necessary” behaviour, Opus 4.7 low is the cheapest path in the current lineup.

Tool calls are Sonnet’s native language

Sonnet 4.6 issues 9 tool calls at every rung, with no sensitivity to effort. Opus variants issue 3 at low/medium, jumping to 7 to 9 only at high and above.

Sonnet treats tool_heavy_task as a multi-step agentic plan from the first prompt: Bash round-trip, Glob, Read against specific files, then synthesise. Opus at low/medium prefers to guess repository structure from the prompt and issue a single Read sweep. Only at high does Opus insert a Glob discovery pass, doubling the tool count.

Plain English: Sonnet acts like an explorer from the first second; Opus at low effort acts like a confident guesser. Different workflow personalities from the same prompt.

The operational tell. If your workload depends on the agent discovering structure rather than being told it, medium -> high is where Opus graduates into tool-assisted behaviour. On Sonnet, that graduation has already happened at low; you cannot turn agentic behaviour off with a lower effort setting.

Turn counts tell the same story

Sonnet 4.6 takes 18 to 19 turns to cover 10 prompts at every effort level. Opus variants finish in 11 to 14 turns. Each turn is an assistant response cycle, including tool_use / tool_result turns that inflate count without producing user-visible text. Sonnet’s extra 5 to 8 turns are the tool-use round-trips the previous chart hinted at; Opus packages more into a single turn.

Practically, this affects latency (more turns = more round-trips = more wall-clock) and session complexity (more tool_result blocks = more JSON to parse if you pipe the session downstream).

Is the knob actually making answers better?

Every compliance-checked run is N = 9 and every cell lands at 78 percent, 89 percent, or 100 percent. Those are the only three possible values. One flip moves the bar 11 percentage points. McNemar p-values across every pair sit at 0.25 to 0.5. The data cannot rule out a coin flip.

Plain English: with 9 prompts, one prompt going from fail to pass moves the rate by 11 points. You cannot draw strong conclusions from a 9-sample survey, and that is why the p-values above refuse to call the differences significant.

The one pattern worth flagging, directionally. Opus 4.6 hits 100 percent at high and regresses to 89 percent at max. Opus 4.7 and Opus 4.5 also regress from high to max. The “longer answer is better” hypothesis does not hold in this range, consistent with stack_trace_debug flipping from pass to fail at max when the model exceeds the 200-token cap while speculatively explaining the root cause.

If you are using reasoning_effort to improve compliance, high is the sweet spot, not max. But be prepared to measure this on your workload. N = 9 cannot commit for you.

The `max` rung is its own phenomenon

Divide each model’s top-effort output by its low output.

Opus 4.7 at max burns 6.5x more output tokens than at low. That is a regime change, not a rung change. Opus 4.5 by contrast is least reactive: max produces only 2.1x more output than low, and as noted it produced less output than medium in this bundle.

The behavioural implication: the max rung does different things in different generations. On Opus 4.7 it unlocks a “think about every turn” mode. On Opus 4.5 it caps out early. Do not assume the knob has the same range across model families.

Which prompts react to the knob?

One cell per (prompt, model). Each cell is top-rung output divided by low-rung output for that prompt on that model.

The top three rows scream. claudemd_summarise, english_prose, and code_review inflate 4 to 29x when you crank the knob, across every model. These prompts have prose briefs with either soft word-count predicates or no hard output-shape constraint. Extra budget gets spent on commentary, rewrites, and metadiscussion.

The bottom three rows do not move. json_reshape, csv_transform, and tool_heavy_task all sit in the 0.65 to 2.6x band. They have structural predicates (valid JSON, no prose preamble, tool-call round-trips) that absorb any extra budget into a no-op. Raising effort on these prompts makes the session cost more without making the answer meaningfully longer or different.

Sonnet 4.6’s 29x on claudemd_summarise is the single largest effort-driven output explosion in the bundle. The low-rung Sonnet produces a terse 259-token summary. The high-rung Sonnet produces a 7,501-token treatise that overruns the word-count predicate.

Classify your own workload into “prose brief” vs “structured predicate” buckets. If it is mostly prose brief, budget for the knob. If it is mostly structured, default to medium and forget it.

Four models, four shapes

A radar chart per model at medium effort, with six axes: verbosity, thinking coverage, tool calls, session turns, cache I/O, and total cost. Each axis is normalised so the per-axis maximum = 100 percent.

The four models have qualitatively different shapes:

Sonnet 4.6 the “agent”: high tool usage, high turn count, high cache I/O, full thinking engagement. Moderate cost footprint because of the pricing advantage.
Opus 4.5 the “verbose scholar”: peak verbosity in the medium-rung cohort, full thinking, modest tool usage.
Opus 4.6 the “compact middle”: smaller on every axis except thinking coverage.
Opus 4.7 the “focused generator”: highest cost in this slice ($2.07 at medium), lower verbosity, minimum cache I/O, shortest “thinking coverage” petal.

These are different tools for different jobs. The shapes predict where each one shines on your workload.

Cost per correct answer

Session cost divided by IFEval passes. Lower is better. Caveat: N = 9, so one flip is ~10 percent per bar.

Sonnet 4.6 at high: $0.14 per passing prompt. Tied for the bundle’s cheapest non-warm-cache cell with Opus 4.6 and Opus 4.5 at high.
Sonnet 4.6 at medium: $0.09 warm / $0.16 cold on 9 passes. Same pass count either way; only the denominator moves with cache warmth.
Opus 4.7 [1M] at xhigh: $0.14 warm / $0.22 cold on 8 passes. Budget at the cold number; the warm number is the upside when a sibling subprocess catches the 1h cache window.
Opus 4.5 at max: $0.09 per passing prompt (cache artefact cell; strip the free $0.60 and it is ~$0.17).
Opus 4.7 at max: $0.32 per passing prompt. The most expensive legitimately passing cell. Compliance did not improve (regressed to 7/9 = 78 percent) and cost rose.

For a compliance-bounded workload, high is the right effort rung on any of the four models. The marginal cost per passing prompt is similar ($0.14 for Opus 4.6, Opus 4.5, Sonnet 4.6 all at high). max only degrades the ratio.

Where does the medium-effort dollar actually go?

Stacked bar of the priced token-class decomposition at medium effort per model. Cache writes (red, 1h TTL) dominate every Opus bar. Output tokens (yellow) are a footnote on Opus sessions and a meaningful component on Sonnet’s.

Biggest cost driver: the first-turn cache write. Claude Code’s headless subprocess writes its 77K-token system-prompt block to the 1-hour TTL cache on turn 1. That write costs ~$0.78 on Opus or ~$0.47 on Sonnet. Every subsequent turn reads from that cache at $0.50/M (Opus) or $0.30/M (Sonnet).

Three practical implications. The first-turn cache tax is fixed overhead, invariant to effort; a 10-prompt session spreads $0.78 over 10 prompts, a single-prompt session pays the whole $0.78, so bundle your work. Claude Code uses the 1h TTL tier 100 percent of the time, an unavoidable ~25 percent uplift over the 5-minute tier, with no claude -p flag to override. And Opus 4.7 medium is the most expensive cell at $2.07, with $1.57 of that in cache writes (the medium-rung capture had 2x the cache-write tokens the low-rung did: 157K vs 84K).

Sonnet 4.6 medium is shown as two side-by-side stacks: the cold canonical ($1.46) and the warm alt ($0.78). The single line that moves most between them is cache-write: $0.98 on the cold run (162,642 tokens × $6/M at the 1h TTL rate) vs $0.22 on the warm run (36,169 tokens × $6/M), a ~$0.76 swing explained entirely by whether turn-1 landed inside or outside the prompt-cache TTL window. Cache-read rises slightly on the warm side because more later turns read-hit instead of write-missing. On both stacks, output tokens are the only non-cache line that contributes a visually meaningful slice ($0.16 of the $1.46 cold total). Sonnet’s output rate is 60 percent of Opus’s and its output count is ~3x Opus’s, so the dollar product is similar but the share going to output is larger.

Where should you spend?

The benchmarks do not produce a single “best model x best effort” answer. They do produce the following operating rules for the four combinations tested.

Prose briefs (reports, rewrites, summaries). Default to Opus 4.6 or Sonnet 4.6 at medium ($1.00 to $1.46 per 10-prompt run). Opus 4.7 at the same rung paid 1.4 to 2x more for comparable output. For most bytes per dollar, Sonnet 4.6 at high at ~15,500 tokens/$ is the peak. Avoid Opus max unless you specifically want metacommentary.

Structured outputs (JSON, CSV, tool calls). Effort barely matters. Default to the cheapest (model x rung) that hits your compliance target: Opus 4.6 or 4.5 at low (~$1.00). Sonnet 4.6 is cost-competitive but will dispatch tools you did not ask for.

Agentic work (multi-step tool use, repo exploration). Sonnet 4.6 is the natural fit; already in agentic mode at low (9 tool calls, 19 turns, full thinking). On Opus, medium -> high is the phase transition. Effort below high is a false economy here. Anthropic has said Claude Opus 4.7 is better for long-horizon agentic work. Obviously, my tests are fairly limited compared to Anthropic.

Compliance-bounded workloads (writing to a spec). high is the sweet spot on three of the four models. max regresses compliance on Opus 4.7 and 4.5. Measure on your own content; N = 9 is too small to commit.

Caveats

Four confounds worth stating plainly:

Cache warmth. Two cells ran warm on turn 1 (Opus 4.5 max, Sonnet 4.6 medium alt capture) and saved 37 to 47 percent of the cold-start cost. A third, Opus 4.7 [1M] xhigh, has one warm and one cold capture both available, so charts 01, 03, 11 stack them directly. If you are comparing to a price list, always compare to the cold-start number.

Same-model comparisons inflate confidence. The compare-run pairings here were same-model-vs-same-model, so the ratios are effort plus variance, not model. The cross-model conclusions came from aligning separate bundles at the same effort level, which is weaker evidence than a true cross-model paired run would be.

IFEval at N = 9 is directional. McNemar p-values at 0.25 to 0.5 throughout. Treat “A beat B by 1 prompt” as null. A proper compliance benchmark wants N >= 30.

Claude Code’s system prompt is large. About 77K tokens of tooling and context is the floor for every session. Note this 77K tokens is for my Claude Code setup with it’s CLAUDE.md and MCP/Skills/Agents/Tools. That is the cache-write overhead that dominates chart 12. Results on the Anthropic SDK directly (with a user-supplied 500-token system prompt) would see a completely different cost-composition pie. Claude Code is not a tokenizer-clean comparator for raw API pricing; it is a full agent harness.

Subscription allowance was raised to match. Anthropic lifted paid-tier limits when Opus 4.7 shipped. Boris Cherny posted on Threads: “We’ve increased limits for all subscribers to make up for the increased token usage.” The dollar figures above are API-equivalent billing and are unaffected. On Pro / Max / Team / Enterprise the headroom story is softer than the raw token growth suggests. More on this in Six Things to Change in Your Claude Code Setup After Upgrading to Opus 4.7.

Reproduce this benchmark

Prerequisites: an active Claude subscription (Pro / Max / Team / Enterprise) and claude --version returning a current build on your PATH. The skill uses claude -p headless. It inherits your subscription auth and rate limits. No API key involved.

Install the plugin and run the full matrix:

/plugin marketplace add centminmod/claude-plugins
/plugin install session-metrics@centminmod
/reload-plugins

# Each line spawns 20 headless `claude -p` subprocesses.
# Budget 15 to 25 minutes per compare-run invocation on a warm connection.
session-metrics compare-run 'claude-opus-4-7[1m]' 'claude-opus-4-7[1m]' \
    --compare-run-effort low medium --yes --output md html
session-metrics compare-run 'claude-opus-4-6[1m]' 'claude-opus-4-6[1m]' \
    --compare-run-effort high max --yes --output md html
# ... and so on for the rest of the matrix above.

For a single-model ladder shortcut:

session-metrics benchmark-effort --model 'claude-opus-4-7[1m]'

That runs all five rungs of one model in one shot.

Numbers will differ from this bundle’s captures, because cache state and subscription-level latency and caching cannot be controlled between runs. The shape of the conclusions should reproduce; the specific dollar figures will not.

What I learned

Effort is an output-token dial. Not an input dial, not a naive quality dial. Input is fixed at the prompt’s word count; output is where every effort-driven dollar lives, and it grows by up to 6.5x across the knob’s range on the newest model.

Opus 4.7 is a different product. Older Opus variants and Sonnet 4.6 think on every turn by default. Opus 4.7 decides per turn whether to think, and that decision happens inside the model, not inside the effort flag. The flag controls how often it chooses to think, not whether thinking is on.

Sonnet is the volume leader; Opus is the per-word capability lead. Sonnet 4.6 at high delivers 5x more bytes per dollar than Opus 4.7 at high. Bandwidth-bound: pick Sonnet. Capability-bound on a hard problem: Opus at high or max is the ceiling. The two are not interchangeable.

The knob has diminishing returns. IFEval compliance peaks at high on three of four models and regresses at max. Output tokens on Opus 4.5 peak at medium and fall at max. Use max when you specifically want research-mode output.

Cache warmth is a first-order cost driver that can move a bill by 40 percent on identical model and identical effort. session-metrics fires an advisory when two sides drift by 10 pp or more; use it.

What’s next

Five things for the next iteration: cold-start every side with a 61-minute wait between invocations to kill cache-warmth artefacts; scale IFEval to N >= 30 so McNemar can reject; do true cross-model compare-run pairs for tighter error bars; add duration_seconds to bench-data.json for a latency chart; and retain per-content-block token counts (thinking / tool_use / text splits) so the reader-visible-output share stops being approximate.

If you found this interesting, check out Claude Opus 4.6 vs Opus 4.7 Effort Levels And Prompt Steering Benchmarks.

I Ran Opus 4.6 and 4.7 on the Same 10 Prompts at 1M Context. The Newer Model Cost 2.17x More

George Liu — Mon, 20 Apr 2026 19:13:59 GMT

Anthropic just shipped Claude Opus 4.7. Both it and Opus 4.6 now run at 1 million tokens of context. If you pay for Claude Code, the obvious question is: should I switch my default from 4.6 to 4.7?

I opened a fresh Claude Code session, typed one command, walked away, and four minutes later I had the answer. Same 10 prompts, one run on each model, both at 1M context, both at the effort level the model ships with by default. Apples-to-apples.

Short version: Opus 4.7 cost 2.17x more than 4.6 for the same 10 prompts. Instruction-following went up 11.1 percentage points. One single prompt cost 15.70x more on the newer model. For the work I do day to day, that trade might not be worth it. However, Anthropic has also raised paid subscription users’ usage limits to account for the increased token usage by Opus 4.7. So if your usage scenarios benefit from Opus 4.7 over Opus 4.6, it still might be worth it. Test and find out.

The longer version, with per-prompt numbers and a framework for when the newer model is worth it, is what the rest of this post walks through. All three numbers came out of my session-metrics skill’s new compare-run command, which I shipped to my public Claude Code plugin marketplace.

Background

Two days ago, I wrote about publishing my Claude Code plugin marketplace at ai.georgeliu.com/p/my-claude-code-plugin-marketplace. The headline plugin in that marketplace is session-metrics. It reads Claude Code’s own transcript logs and tells you exactly how many tokens and dollars each turn cost, where your cache helped or hurt, and which prompts drove your bill.

The v1.7.x line added a feature called compare-run. You give it two model IDs, it spawns two headless Claude Code processes under the hood (one per model), feeds each the same 10 test prompts, and emits a paired comparison report plus per-side dashboards. No API key needed, it just uses your existing Claude subscription.

I had not actually run a like-for-like Opus 4.6 vs 4.7 comparison at 1M context myself yet. So I did.

The command was literally this:

/session-metrics compare-run claude-opus-4-6[1m] claude-opus-4-7[1m] --compare-run-effort high xhigh

The [1m] suffix tells Claude Code to use each model’s 1 million token context tier. The --compare-run-effort high xhigh flag pins each side to its model’s shipping default: Opus 4.6 defaults to effort level high, Opus 4.7 defaults to xhigh (more on effort levels in the glossary below). Passing both explicitly is redundant but makes the numbers self-documenting when I look at the report three months from now.

TL;DR

Cost ratio, Opus 4.7 [1m] xhigh over Opus 4.6 [1m] high: 2.17x. Absolute delta: +$1.1397 on 10 prompts.
Input tokens (net new, uncached): 0.60x (4.7 emits fewer new input tokens per prompt)
Output tokens: 1.43x (4.7 writes longer responses)
Total billable tokens: 1.36x
IFEval pass rate: A 8/9 (89%), B 9/9 (100%). Delta: +11.1 pp
Skill’s decision-framework verdict: very-expensive bucket. Stay, or use Opus 4.7 [1m] selectively (e.g. code review only).

Key terms (short glossary)

Skim whatever you already know. These are the terms the rest of the post leans on.

Token: the unit Claude counts for billing. Roughly 4 English characters per token, so a short prompt is around 100 tokens and a long technical article might be 5,000.
Context window: how much text Claude can see at once. The default Opus tier sees around 200K tokens. The [1m] suffix means 1 million tokens, useful when you feed in large codebases or long documents. Both sides of this comparison use [1m].
Effort level: a Claude Code setting (/effort low | medium | high | xhigh | max) that controls how hard the model thinks before answering. Higher effort usually means more hidden “thinking” tokens and better answers, at a higher cost. Opus 4.6 defaults to high. Opus 4.7 defaults to xhigh, which uses adaptive thinking (the model decides per turn how hard to think instead of spending a fixed budget every time).
Turn: one round-trip exchange. A user message plus the model’s response is one turn. When the model calls a tool and the tool replies, that tool result is a separate turn too.
Cache read / cache write: once Claude has processed a big chunk of your system prompt or context, Claude Code stores it in a short-lived cache. Follow-up turns pay a cheap cache-read fee instead of re-processing the whole thing. Cache writes cost more up front but save money on every turn after. A lot of the cost story in this post lives in these two buckets.
1-hour TTL vs 5-minute TTL: Claude Code can write cache entries that live for either 5 minutes or 1 hour. The 1-hour tier costs 60 percent more per write token but keeps the savings flowing over longer sessions.
IFEval: a public benchmark (from Zhou et al. 2023) that tests whether a model followed a specific instruction literally. Example: “Your reply must contain exactly 3 paragraphs and mention the word apple twice.” IFEval grades the output programmatically and you get a pass rate like 8 of 9 prompts passed.
pp (percentage points): the literal arithmetic difference between two percentages. 100 percent minus 89 percent is 11 percentage points, written +11.1 pp. That is not the same thing as saying “11 percent better”, which would be a ratio.
Canonical suite: a fixed set of 10 test prompts that ship with session-metrics, picked to cover the content shapes Anthropic’s own tokenizer write-up measured (English prose, CJK prose, code review, refactors, JSON reshape, CSV transform, etc.). Running the same 10 prompts on two models gives you a clean head-to-head number.

Methodology

Both sides ran inside compare-run on the same day, back-to-back, same workstation, same Claude subscription.

The “10 prompts, 11 turns” asymmetry is by design. One of the 10 prompts (tool_heavy_task) asks Claude to call a tool, and the tool’s reply counts as its own turn, so that prompt generates 2 turns instead of 1. That’s why each side has 11 turns total. IFEval compliance is scored on the 9 prompts that have deterministic grading; tool_heavy_task is excluded because its success criterion is “did the tool call succeed”, not “did the text match a rubric”.

The 10 prompts, in pairing order: claudemd_summarise, english_prose, code_review, stack_trace_debug, tool_heavy_task, cjk_prose, json_reshape, csv_transform, typescript_refactor, instruction_stress.

Pairing is by fingerprint. One turn on each side had no partner (the tool_heavy_task tool-result turn lands at slightly different indices on the two sides), so 10 of 11 turns are paired. The unpaired turns are excluded from the per-prompt ratios below but count in the per-session totals.

Per-session totals

Four things jump out.

Opus 4.7 uses fewer uncached input tokens per prompt (6 vs 10). The suite prompts are short, so this is a tokenizer difference, not a content difference. The tokenizer is the piece that converts your text into the number-units Claude actually bills on, and Opus 4.7 ships with an updated one that packs the same English prompts into slightly fewer input tokens.

That direction reverses on larger inputs. The first-turn cache write (Claude Code’s own system prompt) came out at 76,675 tokens on 4.7 vs 56,168 on 4.6, a 1.37x ratio, which matches the “up to 1.35x more tokens than Opus 4.6 for the same input” figure I covered in Six Things to Change in Your Claude Code Setup After Upgrading to Opus 4.7. Short user prompts shrink on 4.7; long system prompts and code blobs grow. Same tokenizer, opposite signs depending on what you feed it.

Opus 4.7 writes longer outputs (1.43x). Part of that is xhigh effort using more adaptive thinking on the hard prompts (thinking tokens are billed as output tokens even though you never see them in the reply). Part of it is simply the model being more verbose on the refactor and instruction-following prompts.

Opus 4.7 wrote 2.71x more data into cache than 4.6. That’s two things: the first-turn cache write where Claude Code stores its system prompt for reuse (76,675 vs 56,168 tokens because of the tokenizer change above) plus a larger final-turn cache write on the instruction_stress prompt. Both sides used the 1-hour cache-write tier. 1-hour writes cost 60 percent more per token than the 5-minute tier ($10 per million tokens vs $6.25 per million on Opus 4 series), but they keep the cached content alive for much longer. The extra-cost-paid row is where a lot of the absolute dollar delta lives.

Opus 4.7’s cache hit ratio landed 9.3 percentage points lower, meaning a smaller share of its total token traffic came from cheap cache reads. The skill has an advisory that fires when the two sides drift by 10 pp or more, so we are just under the threshold. Some of that is simply warm-cache asymmetry between back-to-back runs (Side B started slightly warmer than Side A did).

Per-prompt breakdown

This is where the cost story stops being a clean ratio.

Prompts 2 through 9 all land in a narrow 1.30x to 1.43x cost ratio. That’s boring. It’s also the shape you want: per-turn cache reads dominate billing once the system prompt is warm, and both models are reading the same warm cache at the same $0.50/M rate, so the per-turn ratio ends up close to the output-token ratio (1.43x on aggregate).

Prompt 1 (claudemd_summarise) carries the first-turn cache write. On both sides it’s the single biggest cost line. 4.7 costs 1.37x here because its system-prompt encoding is 1.37x larger.

Prompt 10 (instruction_stress) is the outlier. One turn on Opus 4.7 [1m] cost $0.8604. The same turn on Opus 4.6 [1m] cost $0.0548. That single turn accounts for roughly $0.81 of the $1.14 total cost delta between the two sides. Over 70% of the aggregate 2.17x ratio is this one prompt.

Looking at the Side B detail export, turn 11 on Opus 4.7 reports zero cache reads and a fresh cache write of 83,291 tokens. The previous turn had healthy cache reads. In plain English: something about how instruction_stress (a long, deliberately contradictory instruction-following test) got processed made Opus 4.7 invalidate its warm cache on the final turn and pay to re-encode the entire system prompt from scratch. Opus 4.6 handled the same prompt without any cache disruption. This is exactly the kind of hidden cost the skill is built to surface.

If I trim that one turn from both sides, the 2.17x aggregate collapses to roughly 1.36x ($1.2561 / $0.9220), right in line with the boring middle-of-the-table ratio. That’s the number I’d quote if I were choosing between these models for my own “normal” Claude Code session shape.

But I’m not going to trim it, because the whole point of running a controlled suite is to catch pathological turns. One blow-up turn per session is realistic.

Side B detail export, turn 11 on Opus 4.7 reports zero cache reads and a fresh cache write of 83,291 tokens

Extended thinking usage

“Extended thinking” is Claude’s hidden reasoning step: before it answers you, the model spends tokens on a private scratch-pad that you never see but still pay for. Opus 4.6 at effort high used that scratch-pad on 10 of 11 turns (90.9%). Opus 4.7 at effort xhigh used it on only 5 of 11 turns (45.5%). More thinking usually means more cost, so you’d expect 4.7 to have thought harder. It didn’t.

Opus 4.7’s xhigh is adaptive thinking. The model decides per turn whether to think at all, and how long to think for, instead of spending a fixed budget every time. On the 5 prompts where 4.7 skipped thinking entirely, it still beat 4.6 (which thought on every turn) on IFEval. That is the gain you’re paying for. The 11.1 percentage-point compliance improvement is not coming from “4.7 thinks more”. It’s coming from “4.7 thinks more selectively and writes longer when it does”.

Claude Opus 4.7 adaptive thinking exported HTML

Claude Opus 4.6 extended thinking exported HTML

Decision framework

So what do you do with a 2.17x ratio and a +11.1 pp gain? The skill ships with a canonical decision table:

My matched bucket is very-expensive. 2.17x is prohibitive for a blanket switch, regardless of the +11.1 pp IFEval gain. The framework’s prescription is what I’d do anyway: keep Opus 4.6 [1m] as the default for normal work, route to Opus 4.7 [1m] only for tasks where a +11.1 pp compliance improvement is worth a 2.17x cost.

For me, that’s a narrow set. Code review on high-stakes diffs. Final-pass refactors where correctness trumps iteration speed. Anything where instruction adherence is the gating factor, not generation time.

Caveats

A handful of things to keep in mind before generalizing this to your own workloads.

Single run. Each prompt runs once per side. Tokenizer ratios are usually stable to within a couple of percent across repeats, but one-off captures can swing 10% or so on output-token ratios. Multi-trial support is on the skill’s roadmap.

Same-tier pairing. Both sides are [1m]. Running default tier vs [1m] would conflate the tokenizer change with the context-window change, and the report would fire the context-tier-mismatch advisory. If you want a pure tokenizer-only read, keep tiers symmetric.

Canonical suite, not real work. The 10 prompts cover the content shapes the upstream Anthropic tokenizer write-up measured. My actual daily dev work is skewed heavily toward code-review and codebase-Q&A shapes, not English prose or CJK. On my real workload the ratio is probably closer to 1.30x to 1.50x than 2.17x, once the instruction_stress blow-up is amortized across many more turns. Add your own prompts to the suite and re-run if you want a true workload match.

Cache warmth between sides. Side B ran 14 seconds after Side A finished. Both sides start with a cold system-prompt cache, but the enclosing environment (disk cache, OS page cache, Anthropic edge cache) is slightly warmer on B. That shows up as B’s cache hit ratio being 9.3 pp lower (bigger first-turn write, proportionally fewer reads). Still inside the skill’s 10 pp drift advisory, but it’s a real asymmetry.

Claude Code system prompt drifts over time. If you re-run this comparison in three months against newer Claude Code builds, some of the ratio change will be system-prompt evolution, not model change. The skill tags the Claude Code version it observed in the per-session detail export so you can tell.

Subscription allowance was raised to match. Anthropic lifted paid-tier limits when Opus 4.7 shipped. Boris Cherny, Claude Code lead at Anthropic, posted on Threads: “We’ve increased limits for all subscribers to make up for the increased token usage.” The 2.17x dollar ratio above is API-equivalent billing from the skill and is unaffected. On a Pro / Max / Team / Enterprise subscription the headroom story is softer than the raw token growth suggests. If your usage feels heavier after upgrading, that is expected; the allowance went up with it. I wrote up the fuller setup guidance in Six Things to Change in Your Claude Code Setup After Upgrading to Opus 4.7.

Reproduce it yourself

Two prerequisites: an active Claude subscription (Pro / Max / Team / Enterprise) and claude --version returning a current build on your PATH. The skill uses claude -p headless under the hood; it inherits your subscription auth and rate limits, no API key involved.

Install the plugin:

/plugin marketplace add centminmod/claude-plugins
/plugin install session-metrics@centminmod
/reload-plugins

Run the comparison:

/session-metrics compare-run claude-opus-4-6[1m] claude-opus-4-7[1m] --compare-run-effort high xhigh

Output drops into a timestamped subdirectory under your current project’s exports/session-metrics/ folder: a compare report (markdown + HTML), per-session dashboards (markdown + dashboard HTML + detail HTML), and an _analysis.md scaffold with TODO placeholders for writing up the results. That scaffold is what I filled in to write this post.

If claude -p isn’t available on the machine where you want to analyse the numbers (CI container, locked-down workstation), use --compare-prep to print the manual capture protocol, run the 10 prompts by hand in two interactive sessions, then feed the resulting JSONLs to --compare. Same report, more keystrokes.

What I learned

The aggregate ratio lies about the middle. 2.17x is the headline, but 8 of 10 prompts cost between 1.30x and 1.43x. One turn drove the outlier. If you only read the summary row of a compare report, you miss the shape of the distribution and the specific prompt that cost you real money.

Adaptive thinking changes what you pay for. Opus 4.7 xhigh thinks less often than Opus 4.6 high does, and still wins on compliance. “Fewer thinking tokens, better outputs” is the correct picture for the newer model. It also means per-turn cost is less predictable, because the model chooses.

Cache behaviour is part of the model. The instruction_stress cache invalidation on 4.7 wasn’t a Claude Code bug or my fault. It’s something about how that specific prompt shape interacts with 4.7’s cache discipline on the 1M tier. That’s the kind of thing no benchmark blog post will tell you. You find it by running the comparison on your own machine and looking at the per-turn table.

“Stay or use selectively” is the right answer more often than people admit. The plugin-marketplace release post had me installing new things. This post has me not installing new things, because the data says so. Both are legitimate builder outcomes.

What’s next

I want to run the same comparison on claude-opus-4-6 vs claude-opus-4-6[1m] to isolate the pure context-tier delta on a single model. And then claude-opus-4-7 vs claude-opus-4-7[1m] for 4.7. That’s two more compare-run invocations and should give me a clean three-way decomposition: tokenizer change, tier change, and effort-level change, each priced separately.

I’m also adding two prompts to the canonical suite that reflect my actual daily workload: a codebase-Q&A prompt with a real repo tree in context, and a mid-refactor diff-review prompt. The canonical 10 are a useful neutral baseline. They are not my work.

My Claude Code Plugin Marketplace Is Now Public. Install Session Metrics Skill Plugin

George Liu — Sat, 18 Apr 2026 14:25:04 GMT

In this post:

What got published
What session-metrics does
How to install it
How to use it
How to install on Claude Code Desktop
How to update it
How to remove it
A few gotchas I hit
What’s next

What got published

Two GitHub repos got updated this week. centminmod/claude-plugins is a Claude Code plugin marketplace that went public and centminmod/my-claude-code-setup is a personal-configuration starter that bundles the same skill for direct copy. Updated: created a dedicated Claude Code plugin marketplace page.

The marketplace is currently one plugin wide. That plugin is session-metrics skill - a token usage cost analyzer I built across 19+ development sessions over three weeks. I created this skill so that I could have insights into Claude Code models’ tokens and cost usage at both the project level and also at the individual chat session level. There are still some Claude Code users reporting having hit their 5-hour session limits prematurely, and I’m always curious how their patterns of usage differed from mine. So I’m hoping this session-metrics skill becomes a useful tool for others as well.

There’s been many updates to session-metrics skill the full change log is here. Some recent highlights:

v1.22.0 added 9-category turn waste classification — every assistant turn is labelled as productive, retry error, file re-read, verbose edit, dead end, cache payload, extended thinking, subagent dispatch, or normal. The dashboard shows a stacked-bar distribution chart and drill-down cards per waste category.
v1.23.0 extended the per-turn drawer with a “Turn Character” section: a colour-coded classification label (amber for risk, green for productive) and a one-sentence explanation derived from that turn’s actual data — the specific re-read filenames, exact cache percentages, thinking block counts, etc.
v1.24.0 refined the file re-read classifier: subagent-boundary re-reads (when a new subagent starts fresh after a model switch) are now shown as informational rather than wasteful, and the first access in any context segment is no longer incorrectly flagged.

I’ve covered the design and the demo findings in companion posts already:

This post is the release note. It tells you what to type to get it.

Before the marketplace existed, sharing a Claude Code skill meant one of two things. Either I pointed people at my-claude-code-setup and asked them to cp -r .claude/skills/session-metrics ~/.claude/skills/, or I handed them the repo and they took a copy that went stale the moment I pushed the next fix. Neither route had versioning. Neither route had auto-updates. And neither route survived me renaming a flag or bumping the pricing table.

Claude Code’s /plugin marketplace system fixes both. I add the marketplace once. Claude Code fetches the manifest, stores it in ~/.claude/plugins/cache/centminmod/, and handles the install into a namespaced slot so it doesn’t collide with anything I already had locally. When I bump the version, users pick it up on the next update.

What session-metrics does

Short version: it reads Claude Code’s raw JSONL conversation logs under ~/.claude/projects// and produces a per-turn breakdown of token usage, cache efficiency, cost, and user-activity patterns. Five export formats. Four chart libraries to pick from for HTML. Zero network at runtime. Stdlib-only Python, runs via uv run python. You’ll need to have installed Astral uv first.

Useful for:

Understanding exactly what each turn cost, not just a session total.
Spotting where your prompt cache breaks mid-session (edit a CLAUDE.md, the next row’s cache reads drop to zero).
Debugging 5-hour block consumption and the weekly session cap on Max plans.
Attributing cost to models when you mix Opus and Sonnet.
Seeing your own activity heatmap by hour-of-day and weekday, so you can shift off Anthropic’s crunch hours.

What it reports includes a per-turn timeline, 5-hour session blocks anchored at each window’s first event, a weekly roll-up of cost and turns against the prior 7 days, session duration plus burn rate per session, an hour-of-day bar chart plus a 7x24 weekday punchcard in your local timezone, and a user-prompts-by-time-of-day breakdown that filters out tool-result entries so only real typed prompts count.

The HTML export splits into a Dashboard page with the summary cards and a Detail page with the 3D stacked column chart and per-turn table. Both pages are self-contained. The chart bundle is vendored into the repo at scripts/vendor/charts/ with SHA-256 verification before inlining, so if the hash doesn’t match the script refuses to ship that file into the HTML.

One detail worth flagging. Every turn on a single model produces a single cached prefix. If you switch mid-session from Opus to Sonnet, the new model writes its own cache from scratch and you pay for it. The skill surfaces that as a cache-write spike on the turn right after the switch, which is usually the cheapest “tell” you’ll get.

session-metrics skill plugin v1.24 first ever individual session Turn 1 insight for the very first session I had to make this plugin marketplace public. This is where the skill plugin’s public life started 😀

How to install it

The install is three lines of slash commands in the Claude Code terminal CLI (claude in your shell) or via Claude Code desktop app (read further down for those instructions). The /plugin commands aren’t wired into the desktop app or IDE extensions yet, so installs have to happen here. Once installed, the skill auto-triggers from every surface.

Open a terminal, run claude, and paste:

/plugin marketplace add centminmod/claude-plugins
/plugin install session-metrics@centminmod
/reload-plugins

That’s it. You’re done. The three commands do three distinct things, and each one prints its own confirmation so you know the chain held.

Step 1: add the marketplace

The command fetches https://github.com/centminmod/claude-plugins/.claude-plugin/marketplace.json, validates it, and adds centminmod to the set of marketplaces Claude Code will offer plugins from. It prints Successfully added marketplace: centminmod on the next line. The marketplace is now registered but no plugin is installed yet.

If you type /plugin on its own and tab to the Marketplaces pane, you’ll see centminmod listed alongside any other marketplaces you had configured. In the screenshot below I have six marketplaces registered. Yours will vary.

Selecting centminmod opens its detail panel: how many plugins it offers, when the marketplace was last updated, and the three actions you can take on it.

Step 2: discover and install the plugin

/plugin install session-metrics@centminmod does the install directly. If you’d rather browse, /plugin plus the Discover tab shows every plugin across every marketplace you’ve added. session-metrics shows up there with its version and a one-line description.

Press Enter on the row to see the full details, including the install scopes. You can install for yourself (user scope, available everywhere), for all collaborators on the current repo (project scope, commits a config file), or just for yourself in this repo only (local scope). For most people, user scope is the right default.

Back at the prompt, the confirmation message is ✓ Installed session-metrics. Run /reload-plugins to apply.

Step 3: reload plugins

/reload-plugins tells Claude Code to re-scan every plugin it knows about and register the components (skills, commands, agents, hooks, MCP servers, LSP servers). On my machine the reload output reads Reloaded: 10 plugins · 9 skills · 14 agents · 9 hooks · 0 plugin MCP servers · 2 plugin LSP servers. Yours will show different numbers depending on what else you have installed.

After the reload, session-metrics is live. You can verify by typing /sess and watching autocomplete offer /session-metrics:session-metrics.

How to use it

Two paths. You can ask Claude the natural-language questions the skill’s SKILL.md is trained on, or you can invoke the slash command directly.

Natural language first. The skill’s description tells Claude to auto-trigger on questions like how much has this session cost?, show me token usage, or what did each turn cost?. In practice, anything that sounds like a cost or cache question in the current project fires the skill. Claude reads the SKILL.md, figures out the right flags, runs the Python script, and hands back the output.

For explicit control, use the slash command:

Export current session metrics to HTML.

/session-metrics:session-metrics export session metrics to html

Export entire project’s session metrics to HTML.

/session-metrics:session-metrics export entire project's session metrics to html

Export entire Claude Code instances’ projects metrics to HTML.

/session-metrics:session-metrics all-projects

The namespace prefix (session-metrics:) is how plugin skills avoid colliding with personal-scope skills. If you also have a direct-copy version of session-metrics installed at ~/.claude/skills/, both will show up in the completion list and both can co-exist.

If you are in plan mode, Claude will build a plan first (this is Claude Code’s default pre-execution step). The plan lays out the context, the command, the default HTML behaviour per SKILL.md (2-page split, Highcharts renderer, output directory), and what it is and isn’t doing on this run. You get a chance to approve, edit, or redirect.

Once approved, the script runs. You will get a timeline table in stdout, four bands of stat cards, a cache-savings footer, a weekly roll-up, a 5-hour block list, and the hour-of-day punchcard. The last thing it prints is the filename it wrote, so you can open it straight away.

The HTML version is where the skill earns its keep. It opens in any browser, fully self-contained, no network needed. The Dashboard page is the executive summary:

Every card tells a different part of the story. Total cost and cache savings are the money line. Cache hit ratio tells you whether the prefix is holding (anything above 90% is healthy on a warm session). The “1h” badge under Cache Write and the “100% 1h” card mean every cache write on this session paid the 1-hour TTL premium, which is correct for a Max plan. Extended thinking engagement reports that 1 of 2 turns used adaptive thinking (50%), which is Opus 4.7’s default behaviour when /effort xhigh is active.

The Detail page is where you go when “what did turn 19 cost” is the actual question:

The per-turn table is the whole point of the skill. Each row is a single assistant turn. The columns are:

# — turn index after deduplication (sidechain/resumed turns are collapsed so the count matches what actually billed).
Time (AEST) — wall-clock timestamp of when the turn started, in your local timezone.
Model — short alias of the model that served the turn (e.g. claude-opus-4-7), so mixed-model sessions are obvious at a glance.
Input (new) — input_tokens from the JSONL: net-new prompt tokens the API had to read fresh, i.e. not served from cache. This is the bucket billed at the full input rate.
Output — output_tokens, including thinking-block tokens and tool_use argument tokens. Anthropic doesn’t expose a separate thinking_tokens field, so extended-thinking cost is already folded in here at the output rate.
CacheRd — cache_read_input_tokens: tokens served from an existing prompt cache entry, billed at ~10% of the input rate.
CacheWr — cache_creation_input_tokens: tokens written into the cache on this turn. The 1h badge marks turns that used the 1-hour TTL tier (billed higher than the 5-minute default); hover the badge to see the 5m/1h split for that turn.
Content — compact letter encoding of the content blocks the assistant emitted on this turn, with a count for each. The letters map to the five block types the Claude API can return:
- T — thinking: extended-thinking blocks. Their token cost is already folded into output_tokens (and therefore the Output column), but the JSONL only stores the block signature, not the reasoning text. A high T count flags a turn where the model deliberated a lot before answering.
- u — tool_use: the model called a tool. Each u is one tool invocation, so u3 means the turn fired three tool calls in parallel. The arguments JSON is billed as output tokens.
- x — text: a natural-language message block back to the user. Pure-conversation turns look like a lone x1; agentic turns often have zero x because the model only emitted tool calls.
- r — tool_result: a tool result block the model incorporated into its reply. These show up on turns where the harness fed tool output back in and the model acknowledged it in the same assistant message.
- i — image: an image block (vision input attached to the turn, or an image returned by a tool).
Zero counts are omitted so the column stays scannable. The shape of the string tells you what kind of turn it was at a glance, independent of token counts: T1 u3 is a thinking-heavy agentic turn, x1 is a plain reply, u1 r1 is a tool round-trip, and T1 x1 is a reasoned conversational answer. This is the behavioural signal that raw token columns can’t give you — two turns with identical Total tokens can have very different Content strings and mean very different things about how the session is being spent.
Total — sum of the four billable token buckets (Input + Output + CacheRd + CacheWr) for the turn.
Cost $ — estimated USD for the turn, computed by multiplying each bucket by its per-million rate for that model and summing. The inline bar gives a quick visual of expensive vs. cheap turns.

The Models section at the bottom spells out the per-million rates the skill applied — $/M input, $/M output, $/M rd (cache read), $/M wr (cache write) — so the cost math is auditable rather than opaque.

session-metrics v1.4.1 added session resumption marker tracking when you resume your sessions.

The dashboard below lists 3 detected session resumptions for this chat session.

Example from 2 of the session resumptions within this chat session

session-metrics v1.5.0 added similar /usage command insights from Claude Code CLI. This is project level export to HTML demo below.

session-metrics v1.11.0 adds a clickable per turn timeline rows to review a right-side full panel with prompt and tool usage details including token usage and costs and additional prompts listing table.

Using ZAI GLM-5.1 LLM model within Claude Code saw a Chinese response once 😆

session-metrics v1.14.1 added all Claude Code project instance exported dashboards for every project within your Claude Code installation - including list of projects by token cost descending and each project has clickable link to individual project token usage/cost metrics.

Export entire Claude Code instances’ projects metrics to HTML.

/session-metrics:session-metrics all-projects

All Claude Code projects index page - across 126 active days for 30 projects.

Daily charted costs with list of projects ordered by descending total token costs.

Hour of day, weekday and user messages by time of day so you have an idea of overall Claude Code usage patterns.

Models aggregate ranking to display total token costs per LLM model in descending order. Besides Claude Opus/Sonnet/Haiku, I also use ZAI GLM-5.x and Google Gemma 4 locally via LM Studio/Ollama, Qwen 3.5 models.

v1.15.0 total HTML export theme and style redesigns to include 4 themes to choose from, Beacon, Console, Lattic, Pulse. The new designs were created with new Claude Design for initial templates to develop further in Claude Code.

From new details page in v1.15.0

Clicking on turn rows or chart reveals right side overlay with more token usage/cost details - including prompts and tool calls and thinking block counts.

A handful of useful command variants once you’re up and running:

# Current session only
uv run python ~/.claude/plugins/cache/centminmod/session-metrics/skills/session-metrics/scripts/session-metrics.py

# All sessions in a project with per-session subtotals and a grand total
... session-metrics.py --project-cost

# Pick an alternative chart renderer
... session-metrics.py --output html --chart-lib uplot     # MIT-licensed
... session-metrics.py --output html --chart-lib chartjs   # MIT-licensed
... session-metrics.py --output html --chart-lib none      # no JS at all

# Override the timezone if auto-detection picked the wrong one
... session-metrics.py --tz America/Los_Angeles

In practice you won’t type any of those. You’ll ask Claude the question in English and Claude will figure out the flags.

How to install on Claude Code Desktop

Open Claude desktop app and go to Customize section.

Customize → Personal plugins → Create plugin → Add marketplace

Add marketplace: https://github.com/centminmod/claude-plugins

Once marketplace is added, it will be in Directory → Plugins → Code

In Claude Code desktop use /session-metric command.

Type:

/session-metrics export session metrics to html

Claude running the session-metric skill.

Within Claude Code desktop app, clicking the dashboard HTML link can open the exported HTML dashboard in the right preview pane.

How to update it

Two ways depending on where you are.

From the /plugin Marketplaces pane, select centminmod and pick Update marketplace. That refetches the manifest, notices the version bump, and updates the installed plugin. The detail screen shows the last-updated timestamp so you can tell at a glance whether you’re current.

Or just type /plugin install session-metrics@centminmod again. Claude Code treats that as “update to latest”. Run /reload-plugins afterwards and the new version is live.

If the Discover UI keeps showing a stale version after an install, run Update marketplace from the Marketplaces pane. That refreshes the catalog and is the cure for the catalog-vs-manifest drift covered in A few gotchas I hit.

How to remove it

If you want to uninstall the plugin but keep the marketplace registered:

/plugin uninstall session-metrics@centminmod

If you want to remove the whole marketplace too:

/plugin marketplace remove centminmod

Both print confirmation and take effect immediately. /reload-plugins picks up the change. You can also do it from the UI - open /plugin, go to Marketplaces, select centminmod, and pick Remove marketplace.

A few gotchas I hit

Two publisher-side things bit me during the publish and smoke-test, captured here so you don’t have to rediscover them.

Catalog-vs-manifest version drift is silent. The Discover UI reads its version field from marketplace.json, not from the per-plugin plugin.json. Bump one without the other and the installed payload is correct but the UI keeps showing the old number. I hit this on v1.3.0 - plugin.json said 1.3.0, marketplace.json still said 1.0.0, and the Discover card kept showing 1.0.0 until I bumped both. The same pattern applies to homepage, description, and category. If you’re standing up your own marketplace, bump both files in lock-step or write a pre-commit check. The consumer-side fix is to run Update marketplace from the Marketplaces pane.

What’s next

The marketplace scaffold is designed to grow. I have a handful of personal Claude Code skills that are candidates for the next plugin slot - including my ai-image-creator and ai-video-creator skills. Whichever lands first gets the same release treatment as this one - one blog post covering what it does, how to install, and the gotchas I found during publish. Subscribe if you want to see them as they ship.

Try the install, ask Claude “how much has this session cost?”, and let me know what the output looks like on a real workload. Bug reports and feature requests are welcome on the session-metrics GitHub issues tab.

I Ran Two Claude Opus 4.7 5hr Sessions In One Day

George Liu — Fri, 17 Apr 2026 14:54:29 GMT

In this post:

Why I’m writing this
The two blocks at a glance
Surface 1: per-turn detail with the session-metrics skill
Surface 2: aggregate view in OpenTelemetry Grafana
Surface 3: the official Claude.ai usage panel
Surface 4: the new Claude Code desktop stats
What I actually shipped in those two blocks
What stops me hitting the limits
What’s next

Why I’m writing this

The Claude Code rate-limit complaints have not stopped. Threads, Reddit, GitHub, Hacker News. People say their 5-hour blocks burn out in minutes. Anthropic acknowledged some of it and shipped fixes. The complaints are still loud. Update: the session-metric skill is now publicly available via my Claude Code plugin marketplace.

I’m using Claude Max $100 plan and not in that camp. On 2026-04-17 I used both of my 5-hour blocks in the same day, both on Claude Opus 4.7 with the 1M context window, both inside Claude Code desktop app for MacOS. The blocks were not continuous - block 1 ran through the afternoon, I took a break, block 2 ran through the evening. 7 actual Claude Code sessions, 763 assistant turns, 41 user prompts, $100.76. Cache hit ratio: 98.3%. I did not hit a session limit. I did not hit a weekly limit. The day ended at 14% of my Max (5x) all-models weekly budget.

This post is not “I figured out a secret”. It’s the opposite. I instrumented my own usage across four different surfaces so I could see exactly what each block cost me, where the cache held up, and which model picked up which turn. Then I cross-checked the same day against the official Claude.ai usage panel and the new Claude Code desktop stats panel.

Companion piece: I built the sessions-metric skill inspection tool itself in three earlier sessions. That story is in I Built a Token Cost Analyzer for Claude Code. Here’s What I Found. The post you’re reading uses the tool on a real day, plus the three other surfaces.

The two blocks at a glance

Anthropic’s “5-hour session block” is a rolling rate-limit window, not a single Claude Code session. A block opens with your first prompt and closes 5 hours after that anchor, regardless of how many actual claude invocations you started in between. On a Max (5x) plan you get a finite number of these blocks per week. Once a block closes, the next one can open at your next prompt.

Here’s the day, anchored in AEST (UTC+10):

Block 1 ran from 12:33 to roughly 15:45 AEST. Block 2 ran from 20:46 to 23:10. About 5 hours of real-world break sat between them. “Active duration” is wall-clock from the first event in the block to the last, not the 5-hour rate-limit window itself.

763 assistant turns from 41 user prompts is an 18.6:1 ratio. Most of those assistant turns were tool calls (Read, Edit, Bash) inside the same prompt. The model is looping through work, not chatting back at me. That’s exactly what Claude Code is supposed to do.

sessions-metric skill’s HTML output for all sessions for this Claude Code project.

The “$643.7317” stat card is the savings figure. It’s what those same 763 turns would have cost without prompt caching. Cache reads cost 0.1x base input, cache writes cost 1.25x. At a 98.3% hit rate, almost the entire context payload on every turn hit the cache. Without it, this day would have cost about $744. With it, $100.76. That gap is probably the entire reason I’m not in the rate-limit complaint group?

Surface 1: per-turn detail with the session-metrics skill

This is the surface I built. It reads Claude Code’s raw JSONL conversation logs from ~/.claude/projects//.jsonl, deduplicates on message.id, and produces a per-turn breakdown with five export formats (text, JSON, CSV, Markdown, HTML). The HTML version splits into a dashboard page and a detail page. The dashboard above is the dashboard half.

The detail page is the part that earns its keep:

Each column is one turn. Amber dominates because cache reads are the bulk of every turn’s input. The brief orange spike near turn 40 is a cache write where the conversation prefix grew enough to re-cache. The red line is the per-turn cost, riding the secondary y-axis. You can see the model paying steady $0.20 to $0.30 per turn for most of session 1 with occasional spikes when a turn does heavy planning or reads a large file.

Down at the row level, every turn has a local-timezone timestamp (AEST in my case, since the skill now exports in the user’s own timezone), model name, and full token breakdown:

What this view tells you that nothing else does:

Where the cache breaks. A row with cache reads at 0 and cache writes spiking is the moment your prefix changed. Edit a CLAUDE.md mid-session and you’ll see it on the very next row.
The cost shape per session. Session 1 here ran 353 turns and cost $58.39. Most of that is the first 20 turns where the cache is being built. After that, each turn rides cheap.
Idle gaps. Local-timezone timestamps make it obvious when you walked away. On Pro (5-min cache TTL) any gap longer than 5 minutes is a guaranteed full re-cache on return. On Max (1-hour TTL) you have more room.

Roll-ups across all sessions in the project sit below the chart:

Notice the model row at the bottom. Every turn ran on claude-opus-4-7. No mid-session model switches, no Sonnet drop-ins. That matters because each model maintains its own cache. Switching mid-session forces a full re-cache write of the entire prior context. On a 92K-token context that single switch costs around $1.73 in re-cached input. I covered why in the companion piece.

~95% usage was with Claude Opus 4.7 1 million token model.

The skill also breaks down user-prompt timing:

41 user prompts spread across the afternoon and evening. Zero in the morning, zero overnight. Brisbane afternoon (12:00-18:00) overlaps US Pacific evening, which is firmly off-peak for Anthropic’s documented weekday 5-11 AM PT crunch window. The peak in the chart is 13:00 AEST (15 prompts) which is roughly 8 PM the previous day in PT. I usually try to keep my Claude Code sessions now within Anthropic’s designated off-peak time, which is 5am to 10pm AEST Brisbane time so that my 5-hour sessions do not get consumed as quickly as they do in peak hour.

Surface 2: aggregate view in OpenTelemetry Grafana

Claude Code emits OpenTelemetry metrics if you wire OTEL_* environment variables into its config. I run a Prometheus + Loki + Grafana stack at home and pipe everything into a custom dashboard. This is the macro view of the same day:

The big green 97.8% gauge in the centre is the all-time cache hit rate across every Claude Code project I run, not just this one. The session-metrics skill reported 98.3% for the day’s 7 sessions. The 0.5% gap is the difference between “this project today” and “all projects all time”. Both numbers describe the same behaviour: prompt caching is doing the work.

Some things stand out in Grafana dashboard:

Real-time burn rate of $20.23/hr is high but bounded. A 5-hour block at that rate caps out around $100, which is exactly what block 1 came in at.
Cost per 1K tokens is $0.0007. That’s the blended figure across all caches. Without caching it would be roughly 10x that.
Total Tokens Today: 145.6M, Weekly Total Token Usage: 499.9M. The day was 29% of the week’s tokens. Usually I consume anywhere between 100 and 500 million tokens per day and up to 1.65 billion tokens per week. This has been a slow week because I’ve been playing with Claude Code Desktop and Claude Cowork, which the Claude Cowork OpenTelemetry does not expose for non-enterprise users. Would love to have Claude Cowork exposed OpenTemetry metrics that can be exported to my Grafana instance.

Grafana is the right surface if you want to see trends, alert on anomalies, and forecast monthly burn. It is the wrong surface for “what did turn 19 actually cost”. For that, you go back to surface 1.

Surface 3: the official Claude.ai usage panel

Anthropic added a Plan Usage Limits panel inside Claude.ai for Max subscribers. It’s the only surface that knows what your subscription’s actual cap is. The session-metrics skill and Grafana both report what you used in dollars and tokens. They can’t tell you what fraction of your quota that represents because they don’t see the cap.

After both blocks finished:

The numbers worth sitting with:

Current session: 61% used. That’s the active 5-hour block at the time of screenshot, with 45 minutes left. Not at the limit.
All models weekly: 14% used. After two heavy 5-hour blocks in the same day on Opus 4.7, I’m at 14% of my weekly all-models budget. Linear extrapolation says I could do 7 more days like this and still come in under cap. Realistically I won’t, but the headroom is large.
Sonnet only weekly: 5% used. I barely touched Sonnet 4.7 during these blocks. The Sonnet pool is its own bucket on Max plans, separate from the all-models bucket.
Daily included routine runs: 0/15. Routines are scheduled background tasks. I haven’t moved any workflows over yet, so this is untouched.

One caveat on the weekly numbers: Anthropic reset everyone’s weekly quota two days early this cycle because of widespread reports of Opus 4.7 burning through 5-hour blocks faster than expected. The 14% all-models and 5% Sonnet figures are post-reset usage, not a full week’s worth. Even with that asterisk, the headroom story holds - I ran two full Opus 4.7 blocks in one day and the bar barely moved.

This is the panel that determines whether you actually hit a limit. Everything else is informational. If the All-models bar isn’t full, you have headroom. If it is, no amount of session-metrics dashboarding is going to give you another turn.

Surface 4: the new Claude Code desktop stats

The Claude Code desktop app (the one announced alongside the Opus 4.7 launch) has a stats panel that nobody talks about. It rolls up your entire Claude Code history, not just one project, and surfaces it with a streak counter and a model breakdown.

A few of these numbers reframe everything I’ve shown above:

2,016 sessions, 151.4M total tokens, 136 active days. That’s lifetime across every project I’ve ever opened a Claude Code session in. The 763 turns from this single day are 0.5% of my lifetime usage. The $100.76 day is a heavy day, not a typical day.
Current streak 7d, longest streak 105d. I use Claude Code daily.
Peak hour 6 AM. Across all-time, my mode peak hour is 6 AM Brisbane time, which is 1 PM US Pacific the previous day. Claude’s Off-peak.

The Models tab spells it out:

40.9% of my lifetime token usage is on Google Gemma 4 running locally via Ollama. That number is high because I was actively benchmarking local Gemma 4 inside the Claude Code harness for two earlier posts - Running Google Gemma 4 Locally With LM Studio’s New Headless CLI and Running Google Gemma 4 With Ollama, Claude Code, OpenCode, and Codex. Another 14.5% is on ZAI GLM (4.7, 5, 5.1) routed through my Z.AI Coding Pro plan subscription endpoint inside Claude Code. Only 0.8% is on Claude Opus 4.7, the model I just ran 763 turns through.

What I actually shipped in those two blocks

The two blocks weren’t open-ended exploration. They were focused work on the session-metrics skill itself. The skill was already running but needed an accuracy + production-readiness pass before public release. Block 1 covered the heavy lifting; block 2 was follow-up shape work.

Block 1 (12 of 12 planned steps):

Accuracy fix (HIGH severity). _extract_user_timestamps was counting every type:"user" JSONL entry, including auto-generated tool_result blocks. On tool-heavy sessions this inflated the user-activity histogram 10-20x. The fix was a _is_user_prompt filter that keeps only real user prompts and handles both content shapes (list-of-blocks and plain string). Cost math was never affected. The bug was scoped to the activity histogram only.
Input validators and security hardening. _SESSION_RE, _SLUG_RE, _validate_session_id, _validate_slug, and _ensure_within_projects to block path traversal, symlink escape, and absolute-path injection. Slug regex preserves leading - (Claude Code’s project slugs sometimes start with a dash and an earlier .lstrip("-") was stripping it).
Pricing table with explicit Opus 4.7 entry. Added claude-opus-4-7 and claude-sonnet-4-7 to the pricing table. Same rates as 4.6 at this snapshot. Cost math already worked via prefix fallback, but the model display now reads correctly.
Test harness (24 tests). tests/fixtures/mini.jsonl with 12 hand-crafted lines covering both content shapes, tool_results, isMeta, sidechain, dedup behaviour, and a 5h+ gap. Pytest is the only dependency. All 24 tests pass.
Timezone infrastructure. _local_tz_offset, _local_tz_label, _resolve_tz, --tz and --utc-offset CLI flags. The footer now reads “User prompts by time of day (AEST)” instead of “User activity by time of day (UTC)”. The wording change reflects the bug fix; the data is now user prompts only.
Hour-of-day, weekday punchcard, 5-hour session blocks. New views, all wired into every export format. The 5-hour block view is the one I lean on most. It’s how I anchored block 1 and block 2 in the table at the top of this post.
Peak-hour overlay. --peak-hours H-H plus --peak-tz flags with a translucent band on the hour-of-day chart that re-shifts client-side when you change display tz. Labelled “unofficial, community-reported” because Anthropic has not formally documented the peak windows.
Weekly roll-up + session duration cards. Trailing 7d vs prior 7d with percentage deltas. “new” replaces infinite percentages when the prior period is empty.
2-page HTML split. HTML export now emits _dashboard.html and _detail.html. Dashboard loads in 6% of the single-page size because it skips the inlined Highcharts bundle. --single-page keeps the legacy single-file output for archive use.
Parse cache. Per-JSONL gzipped cache at ~/.cache/session-metrics/parse/ keyed on (stem, mtime_ns, script_version). Measured 11.6 ms vs 99.8 ms on a 4 MB JSONL. About 9x speedup on a re-run.
Vendored Highcharts. JS bundle now lives at scripts/vendor/charts/highcharts/v12/ with a manifest.json recording SHA-256 per file. Hashes verified at render time. No more first-run CDN fetch. The --chart-lib none flag emits a tiny detail page with no JS dependency.
Pricing tier correction. Anthropic moved Opus 4.5-generation models (4.5, 4.6, 4.7) onto a new $5 input / $25 output tier when 4.5 launched, while leaving Opus 4 and 4.1 on the original $15/$75 tier. The pricing table had everything on the old tier. Cost figures for Opus 4.5-generation sessions were overstated by roughly 3x. Regenerate exports after upgrading to see corrected totals.

Block 2 was the polish + publishing pass:

uPlot and Chart.js renderers as MIT-licensed alternatives to vendored Highcharts (which has a non-commercial-free clause). --chart-lib {highcharts,uplot,chartjs,none} dispatch.
Publish-readiness + downstream sync model. The skill is part of a marketplace I’m preparing. Block 2 worked through what needs to be true before the first public push, including documentation of every gotcha (sticky name in marketplace.json, strict default, gh repo create --public being irreversible).
First-time marketplace publish walkthrough docs. Full walkthrough in CLAUDE.md so the step sequence doesn’t have to be reconstructed from chat history later.
Local-timezone timestamps in human-facing exports. Every human-facing timestamp now renders in the user’s local timezone with a (AEST) or (PT) suffix.

That’s the work that produced the $100.76 number in the dashboard. It also explains why session 1 alone was 353 turns and $58.39: a 12-step plan, with each step including its own audit, code, tests, and history-entry commit, runs hot on tool calls.

What stops me hitting the limits

These are the habits that keep my numbers in the green across all four surfaces. Some of them I’ve covered before. The two 5-hour blocks above are the receipts.

Stay on one model per session. Every turn in those 763 ran on claude-opus-4-7. Switching models mid-session forces a full re-cache write. On a 92K-token context that’s roughly $1.73 per switch at Opus cache-write pricing. Do it a few times in a session and you erase the savings caching was giving you. If I need Sonnet for cheap work, I start a new session. Though your mileage will vary depending on your usage patterns. Within Claude Code, regularly check the output for /context to see what is consuming your context before, during and after a chat session. The Claude Code CLI has more detailed /context output listing compared to Claude Code desktop app.

Don’t edit CLAUDE.md mid-session. Any change to the conversation prefix invalidates the cache from that point forward. I edit between sessions, never during.

Watch the cache TTL. Max gives a 1-hour TTL. Pro gives 5 minutes. A coffee break on Pro is a guaranteed cache miss on return. Within block 2 the inter-session gaps stayed under 15 minutes. Within block 1 there was a longer 41-minute gap between session 1 and session 2 - still inside the 1-hour Max TTL, so the cache held, but a Pro user with a 5-minute TTL would have eaten a full re-cache write at that point.

This is my Grafana logged prompts, costs, token usage and code modification rates over a 24hr period

Code off-peak. Anthropic’s known peak window is weekdays 5-11 AM PT. I’m in Brisbane (UTC+10), so my afternoon and evening are US Pacific overnight to mid-morning. The hour-of-day chart confirms 27 of 41 prompts hit afternoon AEST and 14 hit evening AEST. Zero in morning, zero at night. This matters more than it used to: Anthropic has now officially acknowledged that 5-hour session blocks consume faster during peak hours than off-peak, which is part of why some users report blocks burning out in minutes. I covered the official guidance in Six Things to Change in Your Claude Code Workflow for Opus 4.7. Off-peak coding isn’t just nicer; it’s literally cheaper per minute of block time.

Use the 200K context window unless you actually need 1M. Both blocks used claude-opus-4-7 with the 1M context flag because the work involved holding the full skill source plus tests plus references in context. For most tasks the 200K default is enough and the cache-write amortisation is cheaper. The 1M context is worth it when the full surface area genuinely matters; otherwise it’s overhead.

Lean CLAUDE.md, lean SKILL.md. Both files load into every message in their respective contexts. The starter template at centminmod/my-claude-code-setup ships with MAX_THINKING_TOKENS=8192 in settings.local.json for that reason. Adaptive thinking is now the Opus 4.6+ default and is the better choice for most users. A fixed budget still makes sense if you need deterministic cost control.

One caveat on the CLAUDE.md advice. “Keep it lean” assumes you only run Claude models. I also run ZAI GLM 5.1 inside Claude Code. GLM 5.1 is capable, but at the brevity level where Claude handles progressive disclosure well, GLM 5.1 hits maybe 80% instruction-following accuracy. To close that gap to 90-95%, I had to add more explicit steering and guardrails in my CLAUDE.md and SKILL.md files. The unexpected side effect: those detailed instructions also appear to help Claude on days when developers report degraded Opus 4.6 performance. More explicit instructions leave less room for drift regardless of which model reads them. It’s a theory, but the correlation has been consistent enough that I trust it. The real advice: keep CLAUDE.md as short as possible for the least capable model you run through it. If you only use Claude, lean is fine. If you run multiple models, the extra tokens may pay for themselves in consistency.

Inspect what you actually used, not what you remember. The four surfaces in this post each tell you something the others can’t. Per-turn detail lives in surface 1. Trends live in surface 2. Quota state lives in surface 3. Lifetime mix lives in surface 4. If a complaint is going to land for me, it’ll show up on surface 3 first; everything else is diagnostic.

What’s next

I want to use this session metric skill to be able to inspect future Claude Code coding sessions so I have an idea of what my token usage and costs are at the per-session level.

~~Also, I’m preparing the session-metrics skill for public release through a Claude Code plugin marketplace. When that ships I’ll write a short follow-up with the install instructions.~~

Update: the session-metric skill is now publicly available via my Claude Code plugin marketplace.

Two more inspection ideas I want to add to the skill itself: an annotation overlay on the hour-of-day chart for Anthropic’s documented peak window, and a side-by-side comparison view between two arbitrary sessions so you can see what changed between runs of the same prompt.

The longer-term piece is a Cloudflare Worker that serves the dashboard live instead of as static HTML files. That changes the read pattern from “regenerate when I want to look” to “always-on”, which is closer to how Grafana feels.

If you want practical AI building for web apps, dev workflows, and infrastructure, subscribe for future posts. You can also follow shorter updates on Threads (@george_sl_liu) and Bluesky (@georgesl.bsky.social).

Six Things to Change in Your Claude Code Workflow for Opus 4.7

George Liu — Fri, 17 Apr 2026 01:38:51 GMT

Claude Opus 4.7 went GA on April 16, 2026. It is positioned as a direct upgrade to Opus 4.6, with notable gains in agentic coding, long-horizon task execution, vision, and self-verification. Pricing is unchanged at $5 per million input tokens and $25 per million output tokens.

The short version: Opus 4.7 runs longer and acts more autonomously than Opus 4.6. If you keep using it the way you used 4.6, you will get a modest step up. If you adjust the workflow around it, you will get a much bigger one.

Boris Cherny, creator of Claude Code, posted a six-part thread on Threads on how he and his team get the most out of Opus 4.7. The tips are small on their own but coherent together. I went through each one, cross-checked it against the Claude Code docs, the migration guide, and the Opus 4.7 announcement, and pulled out what I think actually matters.

Background: what changed in Opus 4.7

Four things matter if you are coming from Opus 4.6:

Adaptive thinking is now the only supported thinking mode. The old thinking: { type: "enabled", budget_tokens: N } API is no longer accepted for Opus 4.7. See the migration guide.
There is a new xhigh effort level sitting between high and max. In Claude Code, xhigh is the default on Opus 4.7 across all plans, overriding whatever you had set for Opus 4.6 or Sonnet 4.6.
The tokenizer changed and uses up to 1.35x more tokens than Opus 4.6 for the same input. Boris announced on Threads that Anthropic has raised paid subscription limits across the board to compensate: “We’ve increased limits for all subscribers to make up for the increased token usage.” If your usage feels higher after the upgrade, that is expected. The allowance went up with it.
The model runs longer and acts more autonomously. Anthropic’s positioning language talks about self-verification, better instruction following, and rigor on complex long-running tasks. In practice, it means you need a workflow that does not rely on babysitting the agent.

The six tips below are the workflow changes that take advantage of that.

Tip 1: switch to auto mode so you are not approving every command

Before Opus 4.7, you had two options. Approve every tool call (safe but slow, and incompatible with walking away from the terminal), or use --dangerously-skip-permissions (fast but reckless). Auto mode is the middle path.

Here is how it works. Before each tool call runs, a classifier reviews it for potentially destructive actions like mass file deletion, sensitive data exfiltration, or malicious code execution. Safe actions proceed automatically. Risky ones get blocked, and Claude is redirected to take a different approach. You only get prompted when the classifier is genuinely uncertain.

A few details worth knowing:

The classifier runs on Claude Sonnet 4.6 regardless of which model your main session uses. Anthropic chose it for speed and cost since it fires on every tool call.
If the classifier blocks an action 3 times in a row or 20 times total, auto mode pauses and Claude Code resumes prompting. That is the built-in backstop.
On entering auto mode, blanket rules like Bash(*) and wildcard script interpreters are dropped so the classifier can see the most dangerous commands. Narrow allowlist rules like Bash(npm test) are preserved.
Auto mode is not a replacement for sandboxing. Anthropic still recommends containers or VMs for sensitive work. The classifier can allow risky actions when user intent is ambiguous, and may block benign ones.

Auto mode is available for Opus 4.7 on Max, Team, and Enterprise plans. In the CLI, toggle it with Shift+Tab. In Desktop or VS Code, pick it from the permission mode dropdown. For the full permission model, see the permission modes docs.

The practical effect is that you can kick off a long-running task and walk away. Or, more realistically, run two or three Claude sessions in parallel because you are not context-switching to click “approve” every ten seconds.

But for some reason, on my Claude desktop app (MacOS Version 1.3109.0 (35cbf6)) , I can’t enable auto mode for Claude Code with my Max $100 plan. I can enable auto mode in Claude Code CLI though.

Even if I enable bypass permissions.

Boris provided screenshot.

Tip 2: run /fewer-permission-prompts once to tune your allowlist

If you do not use auto mode, the alternative is to tune your allowlist so fewer things prompt you in the first place.

Claude Code has a new /fewer-permission-prompts skill that does this for you. Though Claude Code 2.1.111 change log lists as /less-permission-prompts.

It scans your session history for bash and MCP commands that were safe but triggered repeated prompts, and recommends a list of commands to add to your allowlist. You approve the list, and those commands stop prompting.

Permission rules live in .claude/settings.json (or ~/.claude.json) under the permissions block, with allow, deny, and ask arrays. Rules can be as narrow as specific commands (Bash(git status)) or use glob patterns (Bash(git commit:*), Read(./secrets/**)). See the permissions configuration docs for the full syntax.

Two things worth noting:

Allowlist tuning matters most for people not on auto mode, since it is your main lever for reducing friction in default mode.
Even under bypassPermissions, writes to protected directories (.git, .claude, .vscode, .idea) still prompt. Knowing which commands to allowlist means fewer blocked runs without reducing safety on the paths that matter.

If you are on auto mode, this is less urgent but still useful. Auto mode handles the hard calls. A tuned allowlist handles the easy ones without involving the classifier at all.

Tip 3: use recaps when you come back to a long-running session

Recaps shipped as a new Claude Code feature in the week leading up to Opus 4.7, specifically to prep for it. They are short summaries of what the agent did and what is next, shown when you return to a session.

They are most useful when you stepped away for a few minutes or hours and Claude has been working autonomously. Opus 4.7 runs longer than 4.6, and if you have been on auto mode, the agent may have executed dozens of tool calls across a long stretch. A concise recap beats scrolling through the transcript.

Recaps are configurable via /recap or the CLAUDE_CODE_ENABLE_AWAY_SUMMARY environment variable, introduced in v2.1.108. Disable them via /config if you find them noisy.

Recaps work well alongside the existing /compact command, which compresses conversation history when context gets long, and /diff, which surfaces file changes. Together these are the tools for rejoining a session you have not been watching closely.

Tip 4: turn on focus mode to hide intermediate steps

Focus mode in the Claude Code CLI hides intermediate work and shows only the final result. Boris’s phrasing: the model has reached a point where he generally trusts it to run the right commands and make the right edits, so he looks only at the final output. Toggle with /focus.

This only works if the verification story (tip 6) is in place. Focus mode without end-to-end verification is a recipe for silent bugs. Focus mode plus verification lets you judge by outcome rather than by watching every step.

It is also a natural pair with auto mode. Auto mode removes permission friction. Focus mode removes visual noise. Together they are how you run Opus 4.7 as an actual agent rather than a supervised intern.

See Claude Code power user tips for the full list of CLI shortcuts.

Tip 5: tune /effort instead of setting thinking budgets

This is the biggest breaking change coming from Opus 4.6.

Opus 4.7 uses adaptive thinking instead of explicit thinking budgets. You no longer set budget_tokens: N. You tune an effort level and let the model allocate thinking tokens itself. Lower effort means faster responses and lower token usage. Higher effort means more intelligence.

I’ve done some benchmarks for Opus 4.7 for further insights at:

Opus 4.7 introduces a new xhigh level between high and max. In Claude Code, xhigh is the default on Opus 4.7 across all plans, overriding whatever effort level you previously had set on Opus 4.6 or Sonnet 4.6. This is the first thing to check after upgrading.

Boris’s rule of thumb: xhigh for most tasks, max for the hardest ones. max applies only to the current session. Other effort levels persist across sessions. Set via /effort in the CLI.

A few things the docs call out that are worth knowing:

Anthropic recommends starting with xhigh for coding and agentic use, and using at least high for intelligence-sensitive workloads.
Opus 4.7 respects effort levels more strictly than 4.6, particularly at low and medium. If you see shallow reasoning on hard problems, raise the effort level rather than trying to prompt around it.
At xhigh or max, set a large max_tokens budget. Anthropic suggests starting at 64k, because the model has more room to think and act across tool calls. For Claude Code CLI, I already had set environmental variable CLAUDE_CODE_MAX_OUTPUT_TOKENS to 64K.

Tip 6: give Claude a way to verify its own work

This one has always been the highest-leverage tip, and it matters more with 4.7. Boris’s phrasing: “Make sure Claude has a way to verify its work. This has always been a way to 2-3x what you get out of Claude.”

Verification varies by task:

Backend work: Claude should know how to start the server or service and test it end to end. If your project has a make dev or pnpm test:e2e, document it in CLAUDE.md.
Frontend work: use the Claude Chrome extension so the agent can drive a real browser against what it just built.
Desktop apps: use computer use.

Boris wraps this into a custom skill at the end of most of his prompts. His /go skill runs three steps in sequence: test end to end (bash, browser, or computer use), run /simplify, then open a PR.

Two pieces worth pointing at specifically:

/simplify is a bundled skill that spawns parallel review agents to check changed code for reuse, quality, efficiency, and CLAUDE.md compliance. Append it to any prompt after making changes. It is cheap and catches things you would not catch in a manual review.

Custom skills live in .claude/skills//SKILL.md with YAML frontmatter controlling invocation. A /go-style skill is maybe 30 lines of markdown and chains tests, /simplify, and PR creation into one command. See the skills docs for the format.

The core principle: treat Claude like any other engineer. If you would not ship a feature without tests or a browser check, do not let Claude either. With Opus 4.7’s longer autonomous runs, verification is what turns “the model did a bunch of work” into “the model did a bunch of work that actually runs.”

How the tips fit together

The six tips are not independent. They stack.

Auto mode removes the permission friction that kept you babysitting. /fewer-permission-prompts tunes the allowlist if you prefer manual approval. Recaps let you step away and rejoin long-running sessions without losing context. Focus mode hides the noise so you can judge by the final result. Effort tuning (especially xhigh) replaces the old thinking-budget dial with a simpler lever for intelligence versus speed. Verification (via /simplify, browser extensions, computer use, custom skills like /go) is the safety net that makes the first five tips responsible rather than reckless.

Boris’s framing lands well: Opus 4.7 feels like a nice improvement with old workflows and a significant leap once you take the time to adjust. The adjustment is the point. If you are going to bump the model, bump the workflow at the same time.

My Tip

I’d also add my own general tip for folks running up to their 5hr session limits too quickly. Seems some folks are not aware of the reduced usage limits during the 5-hour session limits that Anthropic announced a while back.

Update: May 7, 2026: Claude AI has removed peak hour reduced limits and doubled usage limits under an agreement to use SpaceX compute. Change your Claude Code time of day patterns to Anthropic’s designated off-peak times as peak hours will consume your 5hr session usage limits much faster during weekdays. While weekly usage limits remain unchanged. Anthropic announced peak times as being between 5am–11am PT / 1pm–7pm GMT. I created my Timezones Scheduler app so I can figure out timezone conversions. Check it out at
https://timezones.centminmod.com. Thariq from Anthropic stated on Twitter:
To manage growing demand for Claude we’re adjusting our 5 hour session limits for free/Pro/Max subs during peak hours. Your weekly limits remain unchanged.

During weekdays between 5am–11am PT / 1pm–7pm GMT, you’ll move through your 5-hour session limits faster than before.
I’m logging my token usage at the chat session level with session-metrics skill so can get deeper insights into how Claude modes use tokens - including the token caching. The session-metrics skill plugin is now publicly available too.

What I’m doing with this

I am on Claude Code Max, so auto mode is available. My plan for this week:

Run /less-permission-prompts on my most-used repos to clean up the allowlist before flipping to auto mode.
Verify xhigh is active via /effort. Drop to high if I hit latency issues on simpler tasks.
Write a /go-equivalent skill for each of my active projects so verification is one command away. I already have similar skills, but they are usually project-specific.
Turn on focus mode once I trust the verification loop for each project.

I will post a follow-ups after actual use. If any of these tips break for me in practice, I will let folks know.

Sources

How I Built Polished Demo Videos With Remotion, Claude Code, and Claude Cowork

George Liu — Thu, 16 Apr 2026 12:45:41 GMT

In my previous post, I mentioned building “a demo video pipeline using Remotion” for my Timezone Scheduler on Day 4 of the project. That was one sentence. This is the full story.

Why I stopped screen recording

Screen recordings are fine for a quick Slack walkthrough. For a public product page, they drove me up the wall.

The dead time is the first annoyance. Waiting for a page to load, an accidental misclick, scrolling past the wrong section. You either ship an awkward video or learn timeline editing software to trim it. Neither option is fast.

But the real problem is what happens next. You update a button label. You tweak the color scheme. You add a new feature. Now your demo video is stale, and the only way to fix it is to record the whole thing again. I went through this cycle three times before deciding there had to be a better way.

And if you want both a desktop and a mobile version? Record twice. Edit twice. Sync them when the app changes. Twice.

I wanted demo videos that were reproducible from source, like code. Change a screenshot, re-render, get updated videos. Automatically. You can see the result at timezones.centminmod.com/demo-videos.

Why Remotion

I looked at manual screen recording with an editor like Camtasia, FFmpeg scripting, Motion Canvas, After Effects, and Remotion. Most of these are either not reproducible (screen recording), painful to compose complex scenes with (FFmpeg), or overkill and expensive (After Effects).

Remotion won because it turns videos into code. You define scenes as React components, then render to MP4. Need to update something? Edit the source, re-render. Need a mobile version? Same code, different viewport config. And since it is React, Claude Code could write and iterate on the video components without learning a specialized animation language.

How the pipeline works

The system has three phases: capture the app, animate the captures, render the videos.

Phase 1: Capture the real app

Instead of static mockups, a Playwright script (a browser automation tool commonly used for testing) opens the live Timezone Scheduler in a headless browser (a browser running invisibly under automation). It walks through a 14-step demo: searching for cities, adding them, viewing the meeting scheduler grid, exporting to a calendar, finding the best meeting time, and exploring the interactive world map.

At each step, Playwright takes a screenshot and records the cursor target coordinates (the center of the element’s bounding box). Everything feeds into a manifest JSON file (a structured checklist of screenshots, cursor positions, and timing) that the next phase reads.

The capture runs twice: once at desktop resolution (1920x1080), once at mobile (390x844 at deviceScaleFactor: 2 for retina-quality captures). Same steps, different screen size. One detail I liked: the script sets the app to dark mode via localStorage before any navigation, so every screenshot uses the dark theme consistently without manual toggling.

Here is what one step definition looks like in the config:

{
  label: 'Search "Brisbane"',
  action: "type",
  selector: "#city-search",
  value: "Brisbane",
  waitFor: '#search-dropdown [role="option"]',
  cursorSelector: "#city-search",
  durationSeconds: 2,
}

Each entry says: what to show the user, what action to perform, which DOM element (identified by its CSS selector) to interact with, and how long to display it in the video.

Phase 2: Animate with Remotion

Remotion takes those screenshots and turns them into a polished video. Each screenshot gets three layers of motion: a slow zoom (the “Ken Burns” effect, barely noticeable but it keeps the eye engaged), an animated cursor that glides to the next click target, and a label that slides in to explain what is happening.

Between each step, spring-based transitions create smooth fades. “Spring-based” means the animations use physics simulation instead of linear timing, which makes them feel more natural. Think of how a real object decelerates as it comes to rest, rather than stopping abruptly.

The video opens with an animated title card and closes with a call-to-action.

Phase 3: Render

One command produces two MP4 files: desktop (1920x1080) and mobile (1080x1920) at 30fps, encoded as H.264. Same animation code, different dimensions. The entire Remotion project lives in its own video/ directory with a separate package.json to keep its dependencies isolated from the main app. Render time is about 60-90 seconds per video.

The real story: two passes, not one

The entire pipeline was built through Claude Code in a single session. I described what I wanted in plain English, and Claude Code wrote the capture script, the video components, the render pipeline, and the shell script that chains them together.

I hit render. Two MP4 files appeared. That was a genuine “wait, it actually works?” moment.

But when I played them back, the result was… fine. Functional. The screenshots faded in and out with basic transitions. It looked like an automated slideshow, not a demo video. Desktop was 5.9 MB, mobile 6.8 MB. Small files, because there was barely any actual motion to encode.

Then I found something that changed the result completely.

Remotion’s team publishes a remotion-best-practices skill, a collection of 30+ reference documents specifically designed for AI coding tools. Each document covers a single topic with explanations and ready-to-use code examples:

Animation and timing: springs, interpolation curves, sequencing, trimming
Transitions: TransitionSeries crossfades, spring-timed scene changes
Media: embedding images, video, audio, GIFs, Lottie animations
Text and fonts: text animations, Google Fonts loading, measuring text dimensions
Advanced: 3D with Three.js, charts and data visualization, maps with Mapbox, transparent video rendering
AI integrations: captions and subtitles, voiceover generation with ElevenLabs
Tooling: FFmpeg operations, extracting frames, getting video/audio duration

It is essentially a domain-specific knowledge base that turns a general-purpose AI coder into one that knows Remotion’s idioms. I had not installed it for the first pass. I did not even know it existed yet.

I installed it into my Claude Code setup and said: “Redo the video using these best practices.”

One pass. That is all it took. Claude Code added the Ken Burns zoom, an animated cursor with a click pulse effect, spring-timed transitions between every scene, staggered animations on the intro slide, and smooth fade-out exits on the step labels. I did not tune a single animation parameter. Claude Code picked the timing values from the best practices docs, and they looked right on the first render.

The numbers told the story: desktop went from 5.9 MB to 18.4 MB, mobile from 6.8 MB to 16.8 MB. Roughly 3x larger, because the smoother animations produced more visual information for the video encoder. That file size increase was not bloat. It was actual motion.

That skill has since become a permanent part of my Claude desktop Cowork setup. Any time I start a new Remotion project, Claude already has those best practices loaded. You do not need perfect tooling before you start. Build something that works, then upgrade your AI’s knowledge and let it improve what already exists.

Four commits, 90 minutes

The whole build happened across four git commits:

Initial pipeline (14:46): Full capture + render scaffold with a 12-step demo flow. Working but visually flat.
Polish pass (15:58): Installed the best practices skill and redid the compositions. This is where the video went from “automated slideshow” to “something I would actually put on a product page.”
Feature expansion (17:27): Added 2 new demo steps (calendar export). Updated the capture config and re-rendered. Took minutes.
Docs sync (17:34): Updated the landing page to list all 14 steps.

That third commit is the payoff of the whole approach. Adding new demo steps was not “re-record the whole video.” It was: add two entries to the config, re-run the capture, re-render. The old version of me would have opened a screen recorder, clicked through the whole app again, and spent 20 minutes editing. The pipeline did it in minutes.

What I would do differently

Audio. The videos are silent. A subtle background track or click sound effects would make them feel more complete. Remotion supports audio natively; I just did not prioritize it.

Smoother typing. Each step captures one screenshot, so you see the fully typed text appear at once. A future version could capture multiple frames during typing and stitch them into a mini-animation.

Auto-thumbnails. I manually picked frames for the video poster images. The render script could extract a frame at a specific timestamp automatically.

Custom fonts. The intro and outro use system fonts. Loading a branded font (Remotion supports Google Fonts) would give the titles more personality.

What I learned

Treat demo videos like code, not media. If you maintain a product with a changing UI, a code-driven video pipeline pays for itself after the second re-render. I have re-rendered these videos three times as the app evolved, each time with a single command.

Playwright is a capture tool, not just a testing tool. The same browser automation you write for end-to-end tests can drive a demo video. If you already have Playwright tests, you are halfway there.

AI skills compound, and you can add them after the fact. I did not have the remotion-best-practices skill when I built the first version. I installed it afterward and asked Claude Code to redo the work. The second pass was dramatically better. You do not need perfect tooling upfront. Build something, then give your AI better reference material and let it improve what exists.

What is next

I plan to add audio, improve the typing animations, and try Remotion’s dynamic duration features for more flexible compositions. If there is a specific part of this pipeline you want me to go deeper on, let me know in the comments.

The demo videos are live at timezones.centminmod.com/demo-videos.

I Built a Token Cost Analyzer Skill for Claude Code. Here’s What I Found

George Liu — Thu, 16 Apr 2026 04:20:14 GMT

In this post:

The gap in my dashboard
Why this matters right now
A quick cache primer
The build
What the numbers taught me about token efficiency
Opus 4.7 Updated Skill
What’s next

The gap in my dashboard

I’ve been building with Claude Code daily for weeks. Videos, MCP servers, skills, GUI tools. I already track costs through Claude’s OpenTelemetry integration piped into Grafana. The aggregate picture looked good: 97.8% cache hit ratio, $2,171 in estimated savings across 44 sessions.

But aggregates only tell you what happened in total across all Claude Code projects. They don’t tell you what happened at turn 19 (chat response) within an individual Claude Code chat session.

Cost forecasts, cache hit rates, model efficiency, token usage over time. All useful at the macro level. But none of it shows me why one response costs $0.92 and the next costs $0.03, or where the cache breaks inside a conversation, or what happens when I switch from Opus to Sonnet mid-session.

My Claude Code OpenTelemetry usage metrics on Grafana dashboard

I wanted per-turn visibility at the individual Claude Code chat session level. So I built a Claude Code skill, sessions-metric that reads Claude Code’s raw conversation logs and breaks down every response at the project and project session level. Update: the session-metric skill is now publicly available via my Claude Code plugin marketplace.

There are other popular usage tools, ccusage, ccburn, Claude-Code-Usage-Monitor, codeburn etc, but none would also operate at the Claude Code individual chat session level.

Why this matters right now

This isn’t just my problem. If you use Claude Code, you’ve probably noticed the rate limit complaints. They’re everywhere. Anthropic has acknowledged that “people are hitting usage limits way faster than expected.” Some users report that sessions meant to last hours burn out in minutes. In March 2026, a GitHub issue documented abnormal session limit drain across Claude Max subscribers, and Anthropic confirmed they were “intentionally adjusting 5-hour session limits to manage growing demand.”

Not all of this is user error. Community investigation on GitHub (#41930) identified at least four overlapping root causes hitting simultaneously: intentional peak-hour throttling (confirmed by Anthropic on March 26), two prompt-caching bugs silently inflating token costs 10-20x, session-resume bugs triggering full context reprocessing, and the expiration of a 2x off-peak usage promotion on March 28. Anthropic has shipped fixes for some of these, and Boris Cherny said they’re investigating a 400K default context window (down from 1M) to reduce cache miss costs.

But even after Anthropic fixes the bugs on their end, the user-controllable side still matters. Most developers have zero visibility into what’s happening with their tokens per response. You see totals. You don’t see the moment a mid-session model switch nukes your cache and silently re-writes 92,000 tokens (see below for details). To understand why that matters, you need to understand how caching works.

A quick cache primer

Every Claude Code message sends the full conversation context to the API: system prompt, CLAUDE.md files, tool definitions, and every prior message. Prompt caching avoids reprocessing this prefix each time. Cache reads cost 0.1x the base input price; cache writes cost 1.25x. For Opus 4.6, that’s $1.50 per million tokens versus $15.00 for uncached input. A 10x difference. This is why my 97.8% cache hit ratio translated to $2,171 in estimated savings across 44 sessions.

The catch: caches expire. Max subscribers get a 1-hour TTL; Pro and API key users get 5 minutes. In early March 2026, the 1-hour TTL appeared to regress to 5 minutes, contributing to the wave of “my quota is draining too fast” complaints. There’s also a subtle gotcha: disabling telemetry also disables the 1-hour TTL.

If you take a break longer than your TTL, the cache is gone. Claude Code creator Boris Cherny noted that “prompt cache misses when using 1M token context window are expensive… if you leave your computer for over an hour then continue a stale session, it’s often a full cache miss.” Editing your system prompt or CLAUDE.md mid-session also invalidates the cache from that point forward.

This is where per-turn timestamps become useful. Each turn in the session metrics has a UTC timestamp, so you can spot idle gaps and correlate them with cache write spikes. A 13-minute gap on a Pro plan means a guaranteed full cache miss. On Max with the 1-hour TTL, you’d still be covered. The data lets you see exactly where idle-time cache misses are costing you.

The build

The whole thing came together in three sessions on a single day. Seven commits, 1,423 lines of Python, no external dependencies. It runs entirely on the standard library with uv run python. Sessions 1 and 2 used Sonnet 4.6. Session 3 used Opus 4.6 for the more complex time-of-day logic.

Session 1: “Where are the tokens going?”

It started with a simple question: can I see what each response actually costs?

Claude Code stores every conversation as a JSONL file at ~/.claude/projects//.jsonl. Each line is a JSON object with the full token usage breakdown. The data was already there, sitting on disk. Some tools were built to read and surface this information.

I told Claude to build a script that reads these logs and produces a per-turn cost table. Seventy-three minutes later, I had a working tool: 971 lines of Python, five export formats (text, JSON, CSV, Markdown, HTML), and an interactive dark-themed HTML report with 3D stacked column charts.

The first insight came immediately. Turn 1 of a session costs the most (cache write for the entire context), and every subsequent turn rides the cache for a fraction of the price. Some responses cost $0.03 (a quick Sonnet answer). Others cost $0.92 (a dense Opus planning turn). The numbers were right there, per row, with timestamps.

Session 2: “Why is the report so slow?”

With the tool working, I generated a project-wide report: 44 sessions, 3,031 turns. Then I opened it in Chrome and waited. And waited.

The problem: 49 chart panels, each with its own copy of the Highcharts config. About 336 KB of duplicated JavaScript plus 2,900 table rows rendering on page load.

Twenty-two minutes and three fixes later: deduplicated the chart JS into a single shared data blob (336 KB down to 173 KB), added lazy rendering so charts only initialize when you scroll to them, and collapsed the 2,900 turn rows behind clickable session headers. The page felt instant.

Session 3: “Am I coding during peak hours?”

This was the question that changed my behavior. Claude Code has known peak hours: weekdays 5 to 11 AM Pacific Time. During peak, rate limits tighten. If I could see when I actually use Claude relative to those windows, I could decide whether to shift my schedule.

Before writing code, I had Claude plan the implementation three times. This three-pass approach (build, optimize, self-reflect) caught a JavaScript modulo sign bug before a single line was written: JS preserves sign on % unlike Python, so negative timezone offsets produce wrong buckets. Catching it in planning instead of debugging saved real time.

The result: an interactive heatmap with a timezone dropdown that re-buckets your usage client-side. My numbers showed 1,709 morning messages, 1,235 afternoon, 795 evening, 0 at night. I’d already shifted my heavy coding sessions to off-peak hours before building this tool, so the heatmap confirmed the shift was working.

session-metric skill generated HTML chart at Claude Code project level showing user messages by time of day and token usage over time

Another larger Claude Code project with 96.5% cache hit ratio with 10.7 million input tokens, 26.45 million output tokens, 5 billion cache read tokens and 172 million cache write tokens over 50,971 turns over 842 sessions. This project was mainly before I shifted my Claude Code sessions to off-peak times, so they are shown to be more uniformly spread out throughout the day.

What the numbers taught me about token efficiency

After looking at 3,031 turns of data across 44 sessions, a few patterns stood out.

Cache reads dominate everything. 309 million cache read tokens versus 146,000 uncached input tokens. Claude Code’s prompt caching is doing the heavy lifting.

What breaks the cache:

Switching models mid-session. This was the most visible finding. At turn 19 of one session, I switched from Opus 4.6 to Sonnet 4.6. Cache reads dropped to exactly 0 and cache writes spiked to 92,170 tokens. Each model maintains its own cache, so the entire prior context had to be re-written. At Opus cache write pricing ($18.75 per million tokens), that single switch cost about $1.73 in re-cached input. Do it a few times per session and you burn through the savings you thought caching was giving you. If you need to switch models, start a new session.
Editing CLAUDE.md mid-session. Any change to the conversation prefix invalidates the cache from that point forward. Edit between sessions, not during them.
Long pauses beyond your cache TTL. On Pro (5-minute TTL), even a quick coffee break can nuke your cache. On Max (1-hour TTL), you have more room, but leave for lunch and you’re paying full re-cache cost on return. The session metrics timestamps make these gaps visible: idle time exceeding your TTL shows up as a cache write spike on the very next turn.
Adding or disabling MCP tools also can break your cache prefix.
Topic drift without /clear. The context keeps growing with irrelevant conversation history, increasing token usage per turn without benefiting the current task.

What preserves the cache:

Stay on one topic per session. The conversation prefix stays stable, cache hits stay high.
Use /clear when switching tasks, /resume to come back later.
Batch related requests into fewer, more detailed prompts. Each message triggers a full context send, so fewer messages means fewer sends.
Check costs with /cost during the session and /status before big tasks. Just knowing what things cost changes how you work.

The community has figured out more levers than Anthropic documents. A comprehensive workaround guide on r/ClaudeAI compiled by users tracking rate limits has several Claude Code-specific optimizations:

Disable Adaptive thinking method for Claude Opus 4.6 by setting CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING to 1. And then set old thinking method using, MAX_THINKING_TOKENS to 10000 (down from 32K default), CLAUDE_AUTOCOMPACT_PCT_OVERRIDE to 50 (compact at 50% context instead of 95%), and CLAUDE_CODE_SUBAGENT_MODEL to haiku in settings.json. Reportedly cuts consumption 60-80%. My Claude Code starter template GitHub repo has shipped with MAX_THINKING_TOKENS in its default settings.local.json since before Opus 4.6 introduced adaptive thinking. The two approaches differ: MAX_THINKING_TOKENS sets a hard ceiling on thinking tokens per request, giving you predictable costs but no flexibility. Adaptive thinking lets the model decide how much to think based on task complexity - simple tasks get minimal thinking, complex planning gets more. Adaptive thinking is now the Opus 4.6 default, and for most users it’s the better choice. But if you’re on a tight token budget or need deterministic cost control, a fixed MAX_THINKING_TOKENS (like the 8192 in my starter template) still makes sense.
Keep CLAUDE.md lean. It loads into every message. Users recommend under 60 lines (~800 tokens), pushing detail into docs/ files loaded on demand. A bloated 11,000-token CLAUDE.md is 90% overhead on every turn. More on this below.
Create a .claudeignore file. Like .gitignore, it prevents Claude from reading node_modules/, dist/, lockfiles, and other directories that add context without value.
Switch back to the 200K context window. Anthropic recently changed the default to 1M tokens. If you don’t need the full million, switching back reduces the payload on every request.
Truncate command output in hooks. PostToolUse hooks that pipe through head/tail prevent massive terminal output from inflating context.
There’s also Boucle Framework’s Read-once PreToolUse hook which stops Claude Code from re-reading files it already has in context. A PreToolUse hook that tracks file reads within a session. When Claude tries to re-read a file that hasn’t changed, the hook tells Claude the content is already in context. Saves ~2000+ tokens per prevented re-read.
Change your Claude Code time of day patterns to Anthropic’s designated off-peak times as peak hours will consume your 5hr session usage limits much faster. While weekly usage limits remain unchanged. Anthropic announced peak times as being between 5am–11am PT / 1pm–7pm GMT. I created my Timezones Scheduler app so I can figure out timezone conversions. Check it out at https://timezones.centminmod.com.

Model selection matters, but timing matters more. Opus 4.6 output tokens cost $75 per million. Sonnet 4.6 costs $15. For straightforward code generation, the 5x price difference is hard to justify. Use Opus for architecture decisions and complex reasoning, Sonnet for implementation. But as the turn 19 cache miss shows, switch models between sessions, not during them. The cache write penalty of switching mid-session can erase the savings you were chasing.

Off-peak hours are real. Claude Code’s rate limits tighten during weekdays 5-11 AM PT. My heatmap showed most of my usage lands in that window. Shifting heavy generation work to evenings or weekends visibly stretches the same quota further.

The per-turn data from session metrics makes most of these patterns visible. If your CLAUDE.md is 11,000 tokens, you’ll see it in the cache write on turn 1. If subagents are running on Opus instead of Haiku, you’ll see the cost difference per row.

Claude Code chat session response turns - turn 19 saw cache miss due to switching from Claude Opus 4.6 to Claude Sonnet 4.6 mid-session

First Claude Opus 4.7 Usage Inspection

First Claude Opus 4.7 launch session with Claude Code CLI and at end of session switched to Claude Sonnet 4.6 to run my session-metrics skill to inspect Claude Opus 4.7 token consumption and costs for this specific session. Claude Code desktop app has a handy preview pane on the right to see HTML exported session metrics.

Session metrics turn by turn response drill-down of token consumption and costs. Several times in the session we see a full cache miss - with a full cache write.

Opus 4.7 Updated Skill

Claude Opus 4.7 was released today, so I had it analyse and improve the session-metrics skill and it did a wonderful job. I collected both the session metrics and Claude Code OpenTelemetry Grafana usage metrics for the Claude Opus 4.7 session below as well.

session-metrics v1.0

Session-metrics v1.0 just shipped. Its job stays the same - parse Claude Code’s local conversation logs and tell you exactly what each turn cost - but it now answers the question everyone actually asks: am I about to hit my weekly session cap? A new 5-hour session-blocks view groups activity into the same windows Anthropic’s rate limiter uses, with trailing 7/14/30-day counters. An hour-of-day chart and a 7×24 punchcard show when you’re most active, and an optional peak-hour overlay lets you compare your pattern against the community-reported 5–11 AM PT crunch. HTML reports now split into a lightning-fast dashboard page and a separate chart-and-timeline page, so the overview loads instantly even on projects with thousands of turns.

Under the hood it’s also trustworthy now. A bug that inflated the time-of-day histogram by 10–20× on tool-heavy sessions is fixed. The chart library is vendored into the repo with SHA-256 verification - no first-run CDN fetch, so the “zero network” promise is finally true - and a parse cache makes re-runs ~9× faster on unchanged logs. Inputs are validated against path-traversal tricks, timezones are DST-aware via --tz or --utc-offset, and there’s a 63-test pytest suite locking it all in place. Pass --chart-lib none if you want a minimal no-JS report. Point it at a project, and it’ll tell you where your API dollars actually went - and when.

New sessions metric skill exported HTML data for the Claude Code Opus 4.7 chat session used to improve the very skill itself.

Hour of day, week hour and user messages by time of day for this chat session.

Details page

FYI, API splits input into three fields. Total input sent to the model = sum of all three:

Input (new) = input_tokens (uncached)
CacheRd = cache_read_input_tokens
CacheWr = cache_creation_input_tokens.

And the Claude Code OpenTelemetry Grafana usage dashboard showing the session clearly. Seems Claude Opus 4.7 1 million token window is slight more efficient than Opus 4.6 1 million token window for tokens per $1.

A fuller view of all Grafana dashboard panels.

And Claude AI official usage which includes this chat session and others too.

And Claude Code desktop usage metrics after chat sessions.

What’s next

The immediate additions are annotating Claude Code’s peak hours directly on the heatmap and adding more timezone options. Longer term, I want a Cloudflare Worker that serves these reports as a live dashboard instead of static HTML files.

~~The~~ session-metrics ~~skill isn’t publicly available yet. When it’s publicly available, I will update this article and it will be posted on my~~ ~~Claude Code starter template GitHub repo~~.

Update: the session-metric skill is now publicly available via my Claude Code plugin marketplace.

If you’re shipping with Claude Code, your cache strategy is your cost strategy. Measure your own usage, find the turns where the cache breaks, and fix the habits that cause it. The data is already on your disk. You just need to read it.

If you want more practical AI building for web apps, dev workflows, and infrastructure, subscribe for future posts. You can also follow shorter updates on Threads (@george_sl_liu) and Bluesky (@georgesl.bsky.social).

Claude AI Desktop App Gets Redesigned

George Liu — Wed, 15 Apr 2026 01:13:53 GMT

Today Claude AI released their new Claude AI desktop app and it got a completely new design that is way more useful - especially for Claude Code for desktop. Download the app, or update and restart if you already have it. Check out the documentation to learn more.

Anthropic also released Claude Code Routines. A routine is a Claude Code automation you configure once - including a prompt, repo, and connectors - and then run on a schedule, from an API call, or in response to an event. Routines run on Claude Code’s web infrastructure, so nothing depends on your laptop being open. Claude Code Routine runs have quotas. Max $100 plan has 15 quota. Daily included routine runs - included routine runs per rolling 24 hours. Additional runs use Extra Usage when enabled.

Run sessions in parallel

The new sidebar puts every active and recent session in one place. Kick off work across multiple repos and move between them as results arrive.

You can filter by status, project, or environment, or group the sidebar by project to find and resume sessions faster. When a session’s PR merges or closes, it archives itself so the sidebar stays focused on what’s live.

When you need to ask a question mid-task, you can open a side chat (⌘ + ; or Ctrl + ;) to branch off a conversation. Side chats pull context from the main thread, but don’t add anything back to the thread, to avoid misdirecting your tasks.

Review and ship without leaving the app

The redesign brings more commonly-used tools into the app, so you can review, tweak, and ship Claude’s work without bouncing to your editor:

Integrated terminal: Run tests or builds alongside your session.
In-app file editor: Open files, make spot edits directly, and save changes.
Faster diff viewer: Rebuilt for performance on large changesets.
Expanded preview: Open HTML files or PDFs in-app, in addition to running local app servers in the preview pane.

Every pane is drag-and-drop. Arrange the terminal, preview, diff viewer, and chat in whatever grid matches how you work.

This is the existing Claude Code CLI within Warp.dev terminal app that I currently use.

The new Claude Code within Claude desktop app greets you with your Claude Code usage statistics.

I use more than Claude Opus and Sonnet models in Claude Code CLI previously.

Did you know in new Claude Code desktop app, if you click on Clawd mascot in bottom right it animates and starts typing! 😍

Loving the new task and preview panes. Demo of me using my Claude ai-image-creator skill to edit an existing image with the screenshot of new Claude Desktop GUI using underlying Google Nano Banana 2 for image editing within Claude Code for Desktop 😎

Prompt I used to edit my image:

/ai-image-creator use --analyze flag to analyze both images and deconstruct to JSON structure for full visual image understanding and then create a new comic image but just replace the computer monitors displayed content with the new Claude Code desktop screenshot

Resulting image created within Claude Code for desktop using my Claude ai-image-creator skill.

The new terminal pane so you can have Claude Code for desktop running on left pane and have another terminal pane on the right.

Loaded Claude Code CLI in right terminal pane.

Claude Code with desktop app, running my sessions-metric SKILL which can export to csv, markdown, and HTML a project’s individual session or entire project’s session metrics. With charting and tables for token usage by session turn (response) to be able to see how token consumption usage changes over a session. The new preview pane is awesome showing the exported HTML format project sessions’ metrics.

My Claude Code sessions-metric Skill for a specific project - changing my time of day habits to Claude Code off-peak usage 5am to 10pm.

Allows me to visually confirm what Anthropic folks have stated, when you switch LLM models mid session, it triggers a cache miss and all prior context is re-read. At turn 19 in session, I switched from Claude Opus 4.6 to Claude Sonnet 4.6 and cache reads = 0 and cache writes 92,170 tokens.

Found out that the new Claude Code desktop app’s context usage display updates in real-time as you code. You can see Claude Opus, Sonnet, Haiku chew through your tokens live! 🤓

The more I use the new Claude Code desktop app for MacOS, the more I like it. I just enabled forked sub-agents, and there’s actually a preview pane on the right which can show you what the agent’s doing 😎

Claude Cowork within desktop app. I have a Claude Cowork scheduled task for Anthropic/Claude AI daily news running daily at 9AM.

And the regularly Claude Chat interface within new desktop app.

Anthropic Youtube video.

AI Video Generation in Claude Code and Cowork: I Wired Up 11 LLM Models

George Liu — Tue, 14 Apr 2026 05:24:53 GMT

After building the AI image creator skill - one command, one API key, image lands in your project folder - the obvious next question was: can video work the same way? Turns out it can, but the path there is a lot messier.

So I built ai-video-creator as a companion skill: same pattern, same philosophy, now covering 11 AI video models from your terminal or Claude Cowork desktop with a single API key.

It’s called ai-video-creator, and it’s built on top of KIE.ai - a unified API that aggregates ByteDance Seedance, Kling, Google Veo, Grok Imagine, Wan, Runway, ElevenLabs, and Suno AI behind a single authentication flow. From research, KIE.ai seems to be the cheapest PAYG option that doesn’t require a monthly subscription plan or paying for yearly plans to get a discount.

This post is the honest build story: what I decided and why, what broke in ways I didn’t expect, and what I’d do differently.

Why build an AI video generation skill at all

A few months ago I built an AI image creator skill that generates images from the terminal using OpenRouter as a unified API for five image models. The pattern was straightforward: write a prompt, pick a model, get a file. I’ve used it for every image on this publication.

Video is the same idea, but harder. A lot harder.

With images, every model accepts a text prompt and returns a file. With video, every model has its own rules. Some accept a reference image as the first frame. Some accept a reference image as the last frame. Some accept both. Some only accept one. Some generate audio alongside the video automatically. Some have discrete duration options – you can only request 4, 8, or 12 seconds, not a free number. One model takes 11 to 21 minutes to render and gives you a task ID instead of a result.

KIE.ai solves the account and billing problem: one API key, one credit balance, access to all the models. That part is genuinely useful. What it doesn’t solve is API divergence – each model still has different payload shapes, different polling mechanisms, and different failure modes hiding underneath the unified surface.

Building this skill meant deciding how much of that complexity to hide, and how much to make explicit. I’ll explain the choice I made and why in the architecture section.

How the Claude Code skill is structured

Before getting into the models themselves, it helps to understand what a “skill” is in Claude Code.

A skill is a folder you drop into .claude/skills/ in your project (or globally at ~/.claude/skills/). It contains a SKILL.md file that tells Claude what the skill does and when to load which reference documents. The skill can also contain scripts that Claude runs on your behalf.

The ai-video-creator skill follows the same structure I used for the image creator: one Python script as the single entry point, a lean SKILL.md router, and per-model reference documents that Claude loads only when you’re working with that specific model. Loading all the documentation for all 11 models on every request would burn context window unnecessarily. Instead, if you ask for a Kling 3.0 video, Claude reads kling-3.md. If you ask for a Grok Imagine video, it reads grok-imagine.md. That’s it.

Claude Cowork installed ai-video-creator skill:

The script is stdlib-only Python – no pip install, no virtual environment. This matters because Claude Cowork runs skills in a sandboxed shell where you can’t install packages on the fly. The script has to run with just what Python ships. (Multipart file upload, which you need for reference images, means writing about 30 lines of manual boundary construction that would be a one-liner with the requests library. Worth the trade-off.)

The entry point has seven subcommands:

# Generate a video
uv run python ai-video-create.py video \
  --model seedance-2 \
  --prompt "A cinematic drone shot over a neon-lit city at night" \
  --duration 5 \
  --aspect-ratio 16:9

# Check your KIE.ai credit balance
uv run python ai-video-create.py credits

# Generate AI voiceover (ElevenLabs)
uv run python ai-video-create.py dialogue \
  --voice "Rachel" \
  --text "Hello, this is your AI narrator"

# Generate music (Suno AI)
uv run python ai-video-create.py music \
  --prompt "Lo-fi hip hop beat, relaxed, 90 BPM" \
  --suno-model V4_5

Every video generation follows the same loop: submit a task, get a task ID back, poll every 5 seconds for the first minute then every 10 seconds after that, download the file when it’s ready. Output files are auto-named (seedance-2_20260413_120000.mp4) and land in exports/videos/kie/YYYY-MM-DD/.

The model landscape

Eleven models sounds like a lot. In practice they split into a few families with different strengths.

Credit costs are approximate via KIE.ai as of April 2026. Seedance 2 and Seedance 2 Fast ranges are for 480P at 5s: lower end is image-to-video (reference frame provided), upper end is text-to-video (no reference). 720P roughly doubles the credit cost at both tiers. At $0.005 per credit, a Seedance 2 Fast text-to-video 480P clip runs about $0.39 - confirmed from a real 8s run at 124 credits (15.5 cr/s). The cheapest model I actually tested was Grok Imagine at about $0.05 per clip. Kling 3.0 is the priciest at $0.40+ per clip but the output quality is noticeably more cinematic.

Start with Grok Imagine. Fastest (~32 seconds), cheapest (~$0.05), and it generates audio automatically with every clip – which matters more than it sounds. A short video with ambient sound is meaningfully more shareable than the same clip silent. Validate your prompt here before spending on a pricier model.

One thing nobody documents clearly: Kling 3.0 takes 11 to 21 minutes. If you submit a task and your terminal timeout is 10 minutes, you will never see the result. I added a retrieve subcommand for exactly this:

# Submit the task -- you'll see a task ID and then a timeout message
uv run python ai-video-create.py video --model kling-3 --prompt "..."
# > Task ID: abc123, polling...
# > Timeout reached. Retry with: retrieve --task-id abc123

# Come back later and pick up the result
uv run python ai-video-create.py retrieve --task-id abc123

The retrieve subcommand polls once, downloads if the video is ready, or exits cleanly with the retry command printed out if it’s still rendering. No state file to manage, safe to run multiple times.

When a “unified API” means three completely different APIs

Here’s where experienced developers will want to look more closely.

KIE.ai bills itself as a unified API. The standard flow is: POST /api/v1/jobs/createTask, poll GET /api/v1/jobs/fetchTaskResult, download. About 90% of the models follow this path and the code for them is largely shared.

The other 10% is the interesting part.

Google Veo 3.1 doesn’t use the standard createTask endpoint at all. It has its own endpoint (/api/v1/veo/generate), its own flat payload format (not the standard {model, input} wrapper), and its own polling endpoint (/api/v1/veo/record-info) that returns successFlag integers – 0, 1, 2, or 3 – instead of the state strings every other model uses. Even the response structure is different: you get data.response.resultUrls with direct download links, no URL conversion step needed.

Runway Gen has its own separate endpoint pair too: runway/generate for submission and runway/record-detail for polling. Runway video URLs also expire after 14 days, so the script always downloads immediately on completion rather than storing the URL.

Suno AI for music uses /api/v1/generate instead of the unified task endpoint. Same music subcommand on the surface, completely different code path underneath.

When I ran into this I had to decide: do I write adapters to make Veo and Runway look identical to the standard flow, or do I handle them as explicit branches?

I went with explicit branches. Each divergent API gets an "api": "veo" or "api": "runway" flag in the model registry, and the dispatcher routes accordingly. The code is longer but every failure is easy to trace to its actual cause. A leaky abstraction that hides API differences saves you code until the moment it doesn’t, at which point debugging becomes much harder.

Advanced readers will reasonably argue that adapters are the right pattern here for extensibility. My counter: this is a skill, not a library. It runs on behalf of one user against a fixed set of models. Explicit is better than implicit when you are the person debugging it at 11pm.

The documentation drift problem

Within 24 hours of finishing the implementation I had three silent bugs -- not logic errors, but mismatches between what the script accepted and what the reference docs told Claude to pass. --aspect-ratio documented as --ratio. A disable flag documented as its enable inverse. Wrong subcommand name for Suno. None throw an error; Claude just silently uses the default. You get a video, not the one you asked for.

I did a manual cross-reference, fixed those three, then built a second skill - kie-drift-audit - to automate the same grep-and-compare check. First run found 30 more issues I’d missed. Benchmark result: 100% pass rate with the skill vs 80% without, because the skill routes to non-model reference files a manual read skips.

Any skill complex enough to have multiple reference files will drift. Build the audit tool before you need it.

The GUI nobody asked for (and a useful bug it exposed)

After the CLI was working I built a local web GUI using NiceGUI – a Python library that spins up a local web server with no frontend build step. Six tabs: Video, Dialogue, Music, History, Credits, Setup.

This was not a planned feature. I built it because the CLI flags for video models are dense and I wanted something visual to see all the options for a model without reading a reference doc. The GUI dynamically switches its layout when you change models: Veo gets a panel with first-frame, last-frame, and watermark options. Grok gets a mode selector with fun, normal, and spicy options. Kling 3.0 gets a character reference image uploader. Options that don’t apply to the selected model collapse out of the way.

The implementation went through five design iterations in a single day:

v1: Default NiceGUI dark theme. Functional, not pretty.
v2: Neural Terminal – neon-green-on-black glow effects. Looks great. Immediately feels like a developer meme. Reverted.
v3: ShadCN-inspired “Refined Zinc” palette.
v4: KIE Playground clone, matching KIE.ai’s own product design.
v5: Linear/Vercel-inspired layout with Inter and JetBrains Mono fonts. Shipped.

The most useful finding came from a bug in v4, not the final design.

All the pill-shaped model selector buttons appeared active – all highlighted blue at once – regardless of which model was selected. Turned out NiceGUI’s default color='primary' on button constructors injects bg-primary through the quasar_importants CSS layer, which has higher specificity than any !important rule you write. The fix was color=None on every pill button constructor – five characters per button across about 20 buttons. NiceGUI was silently overriding every custom CSS rule and there’s nothing in the docs warning you about it.

If you’re building UI with NiceGUI and custom styled buttons are all rendering in the same active state: check whether you’re passing color=None.

One important caveat: the GUI doesn’t work as a Cowork plugin. NiceGUI requires its own package, which can’t be installed in the sandboxed stdlib-only environment. The CLI works fine in both Claude Code and Cowork. The GUI requires running the skill locally from a terminal where you can install packages. Two modes of operation with different capabilities – documented, but worth knowing upfront.

Grok Imagine video generation from GUI interface:

Getting Claude Cowork to save files in the right place (four bugs)

Getting the CLI working in Cowork took four debugging sessions, each one revealing the next bug.

Recall that Cowork mounts your project at /sessions//mnt//, not its normal path. Any script that writes to a relative or assumed path will land in the wrong place and lose files when the session ends. The skill needed a sandbox detector to handle this – and that detector had four consecutive bugs.

Bug 1: wrong path check. The detector checked whether /outputs existed at the filesystem root. It doesn’t. Sandbox mode never triggered; videos went to the ephemeral session directory and vanished. Fix: replace with _find_sandbox_project_root(), which navigates /sessions/ and returns the actual mounted project folder.

Bug 2: silent exception. Videos now landed correctly but companion .md notes were missing. save_generation_notes() reconstructs the original prompt by re-calling resolve_prompt(args). If the prompt came via a temp file that Cowork had already cleaned up, resolve_prompt calls sys.exit(1). sys.exit raises SystemExit, which inherits from BaseException – not Exception – so the except Exception block silently swallowed it. Fix: cache the resolved prompt at generation time and pass it directly, skipping re-resolution.

Bug 3: wrong session. The path finder used iterdir()[0] to grab the first alphabetical directory under /sessions/. If another user’s session sorted before yours, it would hit a PermissionError. Fix: use /sessions/$USER directly.

Bug 4: dot-directory. With the session correct, the script picked .remote-plugins as the project root because the exclusion list named specific directories (.claude, outputs, uploads, tmp) but didn’t filter all dot-prefixed names. Fix: exclude any directory whose name starts with ..

Four bugs, one function, each hiding behind the last. The lesson is the same every time: don’t assume what a sandboxed filesystem looks like. Navigate it, read it, then write the detector.

Knowing what you generated

Every successful generation writes a companion .md file alongside the video:

exports/videos/kie/2026-04-13/
  grok-imagine_20260413_120205.mp4
  grok-imagine_20260413_120205.md

The companion file has the full prompt, the exact CLI command that produced the video (copy-paste to regenerate), and a parameters table. I added this after losing track of which model-duration-aspect-ratio combination produced a result I liked. Every output is self-documenting.

Every generation also appends to a local costs.json. Running the costs subcommand shows per-model breakdowns: generation count, elapsed time, approximate spend. Three Grok Imagine test clips cost $0.14 total. Visible cost-per-generation changes how you iterate – at $0.05 a clip you run more variations before committing to a pricier model.

Claude Cowork example video generation using ByteDance Seedance 1.5 Pro:

Generating ByteDance Seedance 2.0 Fast video within Claude Cowork desktop app:

This ByteDance Seedance 2.0 Fast 480p 8 second video cost 15.5 credits/s ($0.0775/s) = 15.5x8 = 124 credits or $0.0775 x 8 = $0.62 via the KIE.ai API.

What didn’t work

Kling 3.0 timeout is a workaround, not a fix. The retrieve subcommand exists because the model takes longer than any reasonable polling window. A better solution would be a persistent state file that tracks pending task IDs, with automatic polling on the next invocation. That’s on the list.

The Veo reference docs were wrong before I tested them. The initial veo-3.1.md had both endpoint paths wrong, used camelCase payload fields where the actual API uses snake_case, and listed firstFrame and lastFrame as top-level keys when they’re actually nested under imageUrls with a generationType discriminator. None of this was visible until the first live Veo generation failed with a 400 error. The lesson: write example files and test them against the real API before writing reference documentation. Testing first would have caught all three errors immediately.

Runway URL expiry is a silent trap. Runway-generated video URLs expire after 14 days. The script downloads immediately at generation time so you won’t lose files if you use the skill. But if you’re calling the raw KIE.ai API directly and storing URLs instead of files, you will lose your output without warning after two weeks.

"Fast" is relative. After publishing, a Cowork session submitted a Seedance 2 Fast job using the single-command flow and got charged twice. Seedance 2 Fast takes 120-180 seconds -- fast compared to its sibling (~300s), not fast compared to the 45-second sandbox timeout. The sandbox cleared with no output, Cowork assumed the submission had failed, and resubmitted a task that was still running fine on KIE's servers. The fix was to stop classifying models as fast or slow entirely: in Cowork, every model now uses a two-phase flow (--submit-only, then poll separately). One rule, no exceptions. The same review also turned up a copied-and-wrong prompt length limit in the Seedance 2 docs (2500 characters, lifted from Seedance 1.5 Pro; the real limit is 1536) -- the kind of error that never throws an exception, it just silently truncates your prompt.

What I learned

Unified API doesn’t mean uniform behaviour. KIE.ai’s single API key and credit balance are genuinely valuable – I’d use it again. But each underlying model has its own payload shape, polling mechanism, and quirks. The unification is at the authentication and billing layer, not the API layer. Build accordingly.

The audit tool pays for itself on the first run. Thirty issues found in the first automated audit, after I had already done a manual check. Any skill with multiple reference files will drift unless you build a way to catch the drift. Do that before you find bugs in production.

Actually read the filesystem. Four sandbox path bugs all came down to assumptions about where things should be, rather than checking where things actually are. When debugging any environment-specific issue, ls first.

Companion notes are worth the extra code. The video is the deliverable. The companion .md with the exact command and parameters is what makes it reproducible. Without it, every generation is a dead end.

How to use it

Note: The ai-video-creator skill is not publicly available yet – it’s still undergoing improvements before release. I’ll update this post and announce when it’s ready. What follows is a preview of the setup for when it is.

You’ll need a KIE.ai account for an API key and credits. One account covers all 11 models.

Setup for Claude Code CLI:

Copy the ai-video-creator/ skill folder to .claude/skills/ in your project
Export your API key in your shell profile:
bash export KIE_API_KEY="your-key-here"
Verify the connection:
bash uv run python ai-video-create.py credits

Setup for Claude Cowork (desktop app):

Cowork runs in a sandboxed shell that doesn’t inherit your shell profile, so the export approach won’t work. Instead, create a .env file at scripts/.env inside the skill folder:

KIE_API_KEY=your-key-here

The script loads this automatically on startup. Shell exports always take priority over the .env file if both are present, so the two modes stay compatible.

Your first video:

uv run python ai-video-create.py video \
  --model grok-imagine \
  --prompt "A developer typing furiously at a glowing terminal, cinematic" \
  --duration 5 \
  --aspect-ratio 16:9

Start with Grok Imagine to validate your prompt cheaply, then step up to Seedance 2, Kling 2.6, or Veo 3.1 once you know what you want.

One configuration step worth doing before any batch job: set your Safe-Spend limit in the KIE.ai dashboard under API settings. The default is low and will block generations mid-run if you hit it. Set it to whatever your comfortable hourly ceiling is, or 0 for unlimited.

The setup guide at references/setup-guide.md in the skill folder covers the full credential flow and per-model parameter reference.

What’s next

Wan 2.7 and Kling 3.0 example generations are still pending – Kling 3.0’s render time makes iterative testing slow. The README comparison table has placeholder cells for those two model families.

Longer term: batch generation from a config file (generate a set of clips from a single spec, similar to how the image skill handles multiple banner sizes), and Veo extend support for concatenating clips beyond 8 seconds.

I’m also planning to write up how I combined this skill with Remotion – an upcoming article on building animated product tours where AI-generated clips and Remotion-composited transitions work together in the same pipeline. Two different tools, different strengths, one polished output.

If you’re building with Claude Code or Claude Cowork and want to add AI video generation, dialogue, or music to your workflow, this is the pattern I’d start with: one script, one API key, models from ByteDance, Google, xAI, Runway, ElevenLabs, and Suno, all accessible from your terminal without switching tools or browser tabs. Check out KIE.ai (referral link) - one of cheapest PAYG video generation providers. Think of them like the OpenRouter AI but for video LLM models.

I Built an MCP Server So Claude Cowork Projects Can Read Each Other’s Sessions

George Liu — Sat, 11 Apr 2026 02:12:09 GMT

Every Claude Cowork project lives in an isolated sandbox. That isolation is a feature: it keeps projects clean, focused, and free from cross-contamination. But it also means one project cannot see another’s session history.

I hit this wall while juggling multiple related projects. I wanted to reference what I had worked through in a different Cowork session – design decisions, architectural dead ends, things that took an hour to figure out. Gone. I kept rebuilding context from scratch.

So I decided to fix it: a Claude Cowork plugin – an MCP server paired with a companion skill – that backs up Claude Cowork project sessions to the cloud, makes them searchable, and lets any project read history from any other project. The MCP handles cloud storage and retrieval. The skill teaches Claude Cowork how to drive it with plain English. The whole thing runs on Cloudflare’s serverless platform.

What I did not plan was running low on my Claude Max weekly quota halfway through. That forced an interesting pivot: hand the build off to OpenAI’s Codex app on macOS, let GPT-5.4 do the actual implementation, and use a structured 05-IMPLEMENTATION-HANDOFF.md plus a CLAUDE-history.md journaling system as the handoff artifacts. This post is the full story.

The Problem: Claude Cowork Project Sessions Are Silos

Claude Cowork runs each project in its own sandboxed environment. Files you create in one project are not accessible from another. Claude Cowork project session transcripts are written to audit.jsonl under the host filesystem at ~/Library/Application Support/Claude/local-agent-mode-sessions/, but there is no built-in mechanism to search them, share them across projects, or even browse them from inside a session.

If you do serious work inside Cowork – debugging, architecture planning, writing – that knowledge evaporates when the sandbox recycles. You can reopen the same project and pick up where you left off, but you cannot ask “what did I decide about this in the timezone scheduler project?” from inside the the project or from another project.

Cross-project session memory is a genuine gap. The solution I arrived at: a Cloudflare Worker that acts as a remote MCP server, giving any Cowork project read access to a shared session archive – plus the ability to back up Claude Cowork project sessions to that archive from within Cowork itself. The MCP does not touch or overwrite Cowork’s own local session files; it is a cloud backup layer, not a replacement for the local transcript.

Planning the Architecture (Inside Cowork)

The planning session happened entirely inside a Claude Cowork project. I started by asking where Cowork actually stores session logs and confirming that a remote approach was the only option that would survive sandbox recycling.

From there, the architecture evolved through five iterations:

Five approaches considered:
- MCP server only (stateless, no persistence)
- MCP server plus a host-side file watcher
- MCP server plus a Cowork scheduled task
- Plugin (MCP + skill bundled together)
- Pure skill using raw curl calls, no MCP

I settled on option 4, a plugin architecture. Cloudflare’s platform was already connected to my Cowork environment via the Cloudflare MCP, giving me native Cloudflare R2, Cloudflare D1, and Cloudflare AI Search bindings without building from scratch. The skill handles orchestration on the Cowork side; the remote Cloudflare Worker MCP server handles storage and retrieval.

The storage stack:

Cloudflare R2: raw session JSONL archives under sessions//.jsonl and search-ready indexed text under indexed//.txt
Cloudflare D1: SQLite metadata index for fast listing, deduplication via content hash, and sync-state tracking
Cloudflare AI Search (AutoRAG): semantic vector search across indexed text – handles chunking, embedding via Cloudflare Workers AI, and retrieval
Cloudflare Workers: the MCP server itself, exposing 5 tools over Streamable HTTP

Why remote hosting over local? env.AI.autorag() only works inside a Cloudflare Worker. Since semantic search was a core requirement, remote was the only viable path. Security was handled with bearer token auth and an Origin allowlist.

The preprocessing insight: Cloudflare AI Search indexes prose better than raw JSON. So the Cloudflare Worker pre-processes JSONL into clean human-readable text before landing it in indexed/. The raw archive under sessions/ stays byte-for-byte exact. Two copies: one for retrieval, one for accuracy.

The sync gap that almost got missed: The original plan assumed Cloudflare AI Search indexed uploads in near-real time. A cross-model review (more on this shortly) caught that the default indexing cycle is every 6 hours. Cloudflare added a manual sync API in June 2025 with a 30-second cooldown. The Cloudflare Worker now calls that API after every backup.

Five Formal Planning Documents

Before any code was written, the planning session produced five structured documents:

Architecture Overview – ASCII system diagram, component descriptions, three data flow paths (backup, search, cross-project read), security model, and storage constraints.

Architecture Decision Record (ADR) – Four options compared across dimensions: binding availability, cross-project access, semantic search capability, operational complexity. The remote Cloudflare Worker approach won on all four.

Product Requirements Document (PDR) – Ten user stories (five P0, two P1, three P2), functional requirements for all five MCP tools, and non-functional targets for latency and storage.

Technical Design Document (TDD) – Full implementation spec: project structure, wrangler.jsonc, TypeScript Env interface, pseudocode for all five MCP tools, Cloudflare D1 schema, JSONL preprocessing logic, auth, Cloudflare AI Search config, error handling, deployment steps, test strategy.

Implementation Handoff – Seven-phase plan with verification checkpoints, “What NOT to Do” guardrails for the implementor, and open questions.

The journaling system was designed alongside the docs. A lightweight CLAUDE-history.md index tracks every meaningful state change in a What / Why / Details / Outcome format, with each entry as a separate file in history/. The guiding question for what gets logged: “Would this be interesting or useful context in a Substack article?” That constraint kept the history signal-rich rather than noisy.

The Quota Moment: Handing Off to Codex

The plan was to hand implementation off to Claude Code CLI. I had a complete TDD, a phased handoff document, and a journaling system Claude Code could update as it built. Clean handoff.

Then my Claude Max $100/month plan’s weekly usage limit got close. Not a hard stop, but close enough that burning through a full Cloudflare Workers MCP server implementation felt like a poor use of remaining quota. I needed another option.

OpenAI’s Codex app on macOS was already installed. It runs GPT-5.4 locally via the macOS desktop app, and I have a ChatGPT Plus subscription which covers Codex access. I had already used it as a code reviewer via the MCP bridge I wrote about in I Built an MCP Bridge So Claude Cowork Desktop Can Talk to OpenAI GPT-5.4. The question was whether it could run the other direction: not review, but build.

I opened Codex, handed it 05-IMPLEMENTATION-HANDOFF.md as the entry point, and let GPT-5.4 work through the TDD spec. It scaffolded the entire Cloudflare Worker MCP server project from the planning documents.

What GPT-5.4 Built

Working from the TDD, Codex scaffolded:

src/index.ts: Cloudflare Worker fetch entrypoint, auth gate, MCP handler wiring via Cloudflare’s agents/mcp package
src/lib/auth.ts: Bearer token validation using crypto.subtle.timingSafeEqual for timing-safe comparison, plus configurable Origin allowlist
src/lib/preprocessing.ts: JSONL-to-indexed-text transformer (the nested message.role / message.content shape Cowork’s audit.jsonl actually uses was discovered later in entry 045 and added then)
src/lib/r2.ts, d1.ts, autorag.ts: Storage helpers for Cloudflare R2 raw/indexed objects, Cloudflare D1 metadata queries, and Cloudflare AI Search retrieval plus manual sync
Five MCP tools: backup_session, search_sessions, list_projects, list_sessions, get_session

The first deployment hit a bug immediately: the Cloudflare AI Search sync route in the original TDD was outdated. The code called POST /accounts/{id}/ai-search/instances/{name}/sync which returned 404. The live route is PATCH /accounts/{id}/autorag/rags/{name}/sync. Codex fixed it, redeployed, and smoke-tested end-to-end within the same session.

The history log from that checkpoint:

Deployed live, found broken Cloudflare AI Search sync route, fixed it, and verified end-to-end backup plus search.

That is entry 014 in the CLAUDE-history.md index. The full session ran to 049 entries before it was done – covering auth hardening, Origin allowlist, Cloudflare D1 integration tests, negative-path validation, observability policy enforcement, skill progressive-disclosure refactoring, and a final operational doc drift review.

The MCP Tool Surface

The deployed Cloudflare Worker MCP server exposes five tools over Streamable HTTP:

backup_session – Receives a Claude Cowork project session transcript as JSONL plus a project name and session ID. Deduplicates by content hash. Writes raw JSONL to Cloudflare R2, indexed text to Cloudflare R2, upserts metadata to Cloudflare D1, triggers Cloudflare AI Search sync.

search_sessions – Takes a natural-language query. Calls search() on Cloudflare AI Search (not aiSearch() – the ranked-chunk retrieval path, not the LLM-synthesis path). Groups chunk results back into session-level hits. Cheaper and faster than synthesis for this use case.

list_projects – Queries Cloudflare D1 for all project slugs and their session counts. Fast because it hits the metadata index, not Cloudflare R2.

list_sessions – Lists Claude Cowork project sessions within a project with timestamps, word counts, and sync status. Also from Cloudflare D1.

get_session – Reads raw JSONL from Cloudflare R2 and paginates by line count. Claude Code warns on MCP outputs over 10K tokens, so this tool accepts max_lines and offset params.

Read tools carry readOnlyHint: true and backup_session carries idempotentHint: true. Low-effort annotations that let MCP clients auto-approve safe calls.

The Cross-Model Review That Improved the Plan

Before handing off to Codex, I sent the entire plan to GPT-5.4 via the Codex MCP bridge for an independent review. This was a deliberate cross-model peer review – a pattern I first described in an earlier post on dual-AI consultation and used again when building the MCP bridge itself. Here it caught several real gaps:

Cloudflare AI Search sync latency. The plan assumed near-real-time indexing. The actual default is a 6-hour cycle. Cloudflare’s manual sync API exists and works, but has a 30-second cooldown between calls. The Cloudflare Worker now treats sync_in_cooldown responses as pending status rather than error, and retries on the next backup.

search() vs aiSearch(). The TDD originally used aiSearch(), which invokes an LLM synthesis step on top of retrieval. For session lookup, you want ranked chunks, not an AI-generated answer. search() is the right call. Cost and latency both drop.

Cloudflare D1 as a metadata index. I proposed Cloudflare D1 during the review discussion; GPT-5.4 independently confirmed it. Cloudflare R2’s flat namespace requires prefix-based listing, which is slow for large session counts. Cloudflare D1 gives fast sorted queries, content-hash dedup, and index-state tracking in the same row.

Tool annotations, pagination, input validation, constant-time auth. All caught in the review, all included in v1.

The ADR was the only document that came through unchanged. Every other doc had revisions based on findings from that session.

The Plugin: MCP + Skill Bundled

The deliverable is a Cowork plugin – a Cloudflare Worker MCP server and a companion skill packaged together. Each part does something the other cannot.

The MCP server (the Cloudflare Worker) handles all the cloud-side work: receiving Claude Cowork project session JSONL, writing it to Cloudflare R2, indexing it for Cloudflare AI Search, querying search, and returning results. It runs remotely and is always available to any project that has the bearer token.

The companion skill (cowork-session-backup) teaches Cowork how to drive the MCP server with plain English. It maps natural-language phrases to the right tool calls: “back up this session” triggers backup_session, “what did I work on in the timezone project?” triggers search_sessions, “list my projects” triggers list_projects. Without the skill, you would have to know the exact tool names and parameter shapes. With the skill, Cowork figures it out from context.

The skill uses progressive disclosure: a core SKILL.md with trigger metadata and workflow steps, plus reference files for operator prompts, client setup, troubleshooting, and a full tool catalog. The core skill stays short; detail files load on demand.

Installing the MCP in Claude Desktop requires a one-time edit to ~/Library/Application Support/Claude/claude_desktop_config.json. Add an entry under mcpServers pointing at the remote Cloudflare Worker endpoint with the bearer token in the Authorization header:

{
  "mcpServers": {
    "cowork-sessions-mcp": {
      "command": "npx",
      "args": [
        "mcp-remote",
        "https://.workers.dev/mcp",
        "--header",
        "Authorization: Bearer "
      ]
    }
  }
}

Restart Claude Desktop and the MCP server appears in the tool list. No local server process to run – mcp-remote proxies the Streamable HTTP connection for you.

Claude Cowork connectors list now shows cowork-sessions-mcp:

Installing the skill works through the Claude Desktop plugin system. Zip the skill directory and upload it via the Cowork plugin manager. Once installed, the cowork-session-backup skill appears in the sidebar and is available to any Cowork project that has the MCP server connected.

How Claude Cowork Project Session Files Are Structured

Before diving into what went wrong, it helps to understand what the MCP server is actually reading. Cowork’s session storage is not one flat file – it is two separate on-disk views of the same session, and they live in completely different places.

The host-side durable layout lives under the Mac filesystem:

~/Library/Application Support/Claude/local-agent-mode-sessions/
  /
    /
      local_.json          # wrapper metadata: title, model, cwd, enabled MCPs
      local_/              # per-session working directory
        audit.jsonl                    # the durable conversation event log
        outputs/                       # transient output files (cleared between sessions)
        uploads/                       # files attached during the session
        .claude/                       # runtime support files

The sandbox-visible mirrored layout is what an active Cowork session can actually read from inside the sandbox:

/sessions//mnt/.claude/projects/
  -Users-george-...--/
    .jsonl         # active-session transcript mirror

These are two views of the same session data, not two separate logs. The host-side audit.jsonl is the durable record – it persists after the session ends. The sandbox-mirrored .jsonl is the active-session surface: it is what the skill can reach without crossing out of the sandbox into unreadable Mac host paths.

The local_.json wrapper is metadata only: title, model, session IDs, which MCP tools were enabled. It is not the transcript. The encoded folder name in the sandbox path (the -Users-george-... segment) is a naming artifact – it encodes the host path as a directory name within the sandbox. It is not an instruction to reconstruct the host path and read from there.

The practical consequence for the MCP server: past sessions are backed up from audit.jsonl on the host side. Active sessions – the one currently running – are reached through the sandbox mirror. Any backup tool that treats both layouts as the same thing will eventually fail on one of them.

What Didn’t Work

The original Cloudflare AI Search sync route was wrong. The TDD spec used a POST to an outdated URL that returned 404. This was caught immediately on first live deployment, not during planning. The CLAUDE-history.md system captured it as a bug-fix entry, which is exactly the kind of thing you want documented.

Cloudflare AI Search indexing has a 6-hour default cycle. If you back up a Claude Cowork project session and immediately search for it, results may not appear. The manual sync API helps, but the 30-second cooldown means batched backups will have some lag. This is a Cloudflare platform constraint, not something fixable at the Cloudflare Worker level.

Cowork’s audit.jsonl shape is not flat. The original TDD assumed flat JSONL with top-level role and content fields. Cowork’s actual event log nests conversational content under a message object: message.role, message.content. The preprocessing layer had to handle both shapes. This was discovered in history entry 045 – after deployment, during real data validation.

aiSearch() was the wrong retrieval method. The plan started with synthesis-based retrieval and switched to ranked-chunk retrieval (search()) after the cross-model review. Both work on Cloudflare AI Search, but synthesis adds latency and cost without adding utility for a session-lookup tool.

The System That Worked – Until I Used It

The build was done. Tests passing, smoke script clean, skill installed. I typed “back up this session” into a live Cowork project.

The model produced a backup. Except the backup was not the transcript. It was a summary the model had assembled from context – an AI-written description of the session, filed as an archive of it. The archive fidelity guarantee the whole system was built around had been silently violated on the first real use.

I had built an archival system that would sometimes archive hallucinations.

The root cause was a gap I had not fully closed: backing up an active session is different from backing up a past one. Past sessions already exist in the archive. Active sessions – the one currently running – live in a sandbox mirror path that looks nothing like the Mac host path the skill’s prose described. When the model could not immediately locate the transcript, it improvised instead of stopping. That is what prose-heavy skill instructions produce: the rules are all there, the model just finds the seams. The frustrating part is I knew this. I had learned it the hard way in previous coding projects. I just gave myself a pass here because it was a skill, not a codebase – and that turned out to be exactly the wrong call.

What followed was a day of closing those seams. Tighter rules helped some – prefer the transcript tool before doing anything else, treat the session ID as already complete, stop hard if no real transcript is reachable. But the real fix was replacing freehand path reasoning with a shell script that does the lookup deterministically and hands back a single path to use verbatim. No interpretation. And the final change went deeper: stop routing transcript content through the model at all. The Worker now has a direct HTTP upload route – the skill resolves the file, stats it, and posts it with curl. The model never reads the content. It just gets back a confirmation.

Search and retrieval still go through MCP. Ingest bypasses it. That separation only became obvious after watching the original design fail under real load – which is exactly the kind of thing the CLAUDE-history.md system exists to capture.

Claude Cowork Project session backup in action:

Using cowork-session-mcp and cowork-session-backup MCP and skill bundle to read a saved Claude Cowork project session backup chat log:

Cloudflare AI Search RAG’s R2 S3 compatible bucket with Claude Cowork projects’ session backups for just my Scheduled Tasks project and AI Search RAG triggered job indexing of text converted session logs to Cloudflare Vectorize database:

What I Learned

A complete TDD is a real handoff artifact. The planning documents worked well enough that GPT-5.4 could implement the entire Cloudflare Worker MCP server without needing clarification on architecture, tool surface, or security posture. That only happens if the spec is actually complete. Pseudocode, schema, Env interface, error handling rules – all of it mattered.

The cross-model review earned its time. Two of the three biggest plan gaps (Cloudflare AI Search sync latency, search() vs aiSearch()) were caught before any code was written. Running the plan through a second model is now a standard step for me on anything that involves unfamiliar APIs.

Quota pressure can produce a better workflow. I would not have tried the full Cowork-plan then Codex-build split if quota had not been a constraint. Having done it, I think it is a genuinely useful pattern: use the model you trust for planning and high-level decisions, then route implementation to whichever model has capacity or fits the task.

When the outcome needs to be deterministic, make the mechanism deterministic. The mental trap is treating a skill as fundamentally different from code – as if natural-language instructions live in a different category where the usual rules about reliability do not apply. They do not. Any time you are steering a model toward a specific, verifiable outcome – find this file, resolve this path, return this value – a script will outperform prose every time. The skill can still orchestrate; it just should not be the one deriving the answer. That is what the shell resolver fixed: one input, one output, no interpretation in between.

Don’t route data through the model if you can avoid it. The direct HTTP upload route was the last fix and probably the most durable. Passing a large transcript as a tool argument puts the model in the middle of a data-transfer operation it has no business being in the middle of. File in, curl out, confirmation back. Keeping the model in the control plane and out of the data plane is a design principle worth applying well before you have watched it fail.

Structured history files pay off when things go wrong, not just when they go right. The CLAUDE-history.md log was designed with a Substack post in mind – “would this be interesting or useful context?” as the filter for what gets recorded. That constraint meant the log captured failures clearly, not just checkpoints. When the day-two hardening sprint happened, the evidence was already written down.

Post-Launch: The Archived Recall Problem

The system worked for a few days. Until I tried to use it for one session recall task.

I asked Cowork what happened in the last backed-up session. Simple enough -- that was the whole point of building this. The model found the right session ID in D1, then tried session_info against it, got nothing (archived sessions are not active sessions), fell back to get_session, started paginating through raw JSONL, ballooned the context window, and eventually tried to hand a temp file to a sub-agent that could not read it. No summary. No useful output. Just a trail of increasingly desperate fallbacks.

The second problem was subtler. Even when search_sessions returned good results, follow-up queries would sometimes pull in chunks from neighboring sessions. The search index does not know you only care about one specific session -- it just returns the highest-scoring chunks across the whole project. If the model did not manually filter every result by session ID, adjacent sessions would bleed in. Same failure mode as the day-two hallucination: the model had interpretive room, and it used it.

The gap was obvious in hindsight. The system had a tool for raw transcript access (get_session) and a tool for exploratory cross-session search (search_sessions). Nothing in between for the most common request: summarize this one specific archived session. The model was improvising a workflow from tools that were not built for that job.

So I added a sixth tool: summarize_archived_session. It takes an exact project and session ID, retrieves indexed evidence only, rejects any chunk that does not match the target session, and returns a structured evidence-backed recap. No raw JSONL. No cross-session bleed. The skill was rewired so “what happened in my last session?” routes through this tool instead of falling into the inspect workflow and eventually hitting get_session. That path is now reserved for when you explicitly want the raw archive.

The gap between “search everything” and “dump raw JSONL” turned out to be exactly where the most common real use case lived.

What’s Next

The Cloudflare Worker MCP server is live and connected to this project. A few things are planned but not done yet:

The current operator flow is Cowork-first. Claude Code CLI session ingestion is documented as a future path but not yet productized – the ingestion layer needs to be built, not the backend.

aiSearch() is deferred to a future ask_sessions tool that would give synthesized answers rather than ranked chunks. Useful for “summarize what I built in March” style queries against Cloudflare AI Search.

OAuth 2.0 multi-user support is also deferred. Right now it is single-operator bearer token. That is sufficient for personal use but not for a shared team tool.

For now: cross-project Claude Cowork project session memory works, search works, and the full session transcript from any past Cowork project is one MCP tool call away.

I Built an MCP Bridge So Claude Cowork Desktop Can Talk to OpenAI GPT-5.4

George Liu — Thu, 09 Apr 2026 14:37:39 GMT

I wanted Claude Desktop and Claude Cowork to be able to delegate tasks to OpenAI’s Codex CLI (powered by GPT-5.4) without leaving the conversation. Not just coding tasks. Second opinions on article drafts, code reviews, security audits, content critiques. Any task where a different model’s perspective adds value. Not as a party trick. As a real workflow where one AI can ask another AI for a second opinion, run a review, or handle a task it is not well suited for.

The problem: Claude Desktop runs inside a macOS sandbox that restricts which external processes it can reach. And even if it could call the Codex CLI directly, Codex speaks its own protocol, manages its own sessions, and has its own security model. You need something in between.

So I built a bridge. A full MCP (Model Context Protocol) server that sits between Claude and Codex, translating requests, managing run state, enforcing security boundaries, and handling all the rough edges of making two AI systems cooperate.

The interesting part: I did not build it alone. The Codex desktop app (GPT-5.4) wrote the initial implementation. Claude Code (Opus 4.6) tested it, found bugs, hardened it, and documented it. Two different AI models, collaborating on the same codebase, each catching what the other missed.

Here is how that went.

Why Cross-Model Matters

Most people use one AI tool. That is fine for simple tasks. But when you are building something non-trivial, or writing something that needs to hold up to scrutiny, a single model has blind spots. It will not catch its own assumptions. It will not question its own reasoning.

Having a second AI review the first AI’s work is like having a second pair of eyes on a pull request or a draft. Except these “eyes” have different training data, different strengths, and different failure modes. I used this bridge to have Codex critique my Substack article drafts, not just code. The screenshots later in this post show exactly that: Codex reading a full article and returning a structured summary with strengths and weaknesses.

This idea did not come out of nowhere. I had already built two custom Claude Code slash commands for exactly this purpose: /consult-codex (which shells out to OpenAI’s Codex CLI for a GPT-5.4 second opinion) and /consult-zai (which does the same via Zhipu’s GLM-Z1 model). I wrote about both in How To Get a Second AI Opinion in Claude Code With Codex CLI and GLM. Those skills work well from the Claude Code CLI, but they cannot reach Claude Desktop or Cowork because Claude Desktop app runs in an isolated sandbox environment. That limitation is what pushed me toward building a proper MCP server: same cross-model idea, but accessible from any MCP client.

What the Bridge Does

The Codex CLI MCP Bridge is a TypeScript MCP server that communicates over stdio (standard input/output, the same pipe-based protocol most MCP servers use). It exposes 8 tools and 7 resources that let any MCP client (Claude Desktop, Claude Code, opencode) interact with a locally installed Codex CLI.

The tools:

codex_run_task starts a new task asynchronously, returning a run ID immediately
codex_continue_run and codex_resume_session let you pick up where a previous task left off
codex_review_repo runs Codex’s code review mode against uncommitted changes, a branch, or a specific commit
codex_get_run and codex_get_run_updates poll status and stream events
codex_cancel_run stops a running task
codex_list_runs shows recent activity

The resources (read-only MCP endpoints) expose run summaries, event logs, diffs, and review output so you can inspect results without re-running work.

A typical flow looks like this:

Claude calls codex_run_task with a prompt and a working directory. The bridge spawns a Codex process and returns a run ID.
Claude polls codex_get_run or codex_get_run_updates to track progress. Events stream back as Codex works.
When Codex finishes, Claude reads the final message or the resource output (summary, diff, or review).

In plain English: you can sit in Claude Desktop, ask it to “have Codex review my uncommitted changes,” and the bridge handles everything. Claude dispatches the request, Codex does the review, and the results come back into your conversation.

The Build: Two AIs, One Codebase

The core implementation was built in a single afternoon (about 4 hours on April 6, 2026), then hardened and integrated over two more sessions that evening and the following day, all through a relay between the Codex desktop app and Claude Code CLI.

Phase 1: Codex Writes the Foundation

I started in the Codex desktop app (GPT-5.4) with a design brief describing what I wanted: a secure MCP bridge for delegating work to the local Codex CLI.

Codex produced the entire initial codebase from scratch in one pass:

10 source modules covering the bridge service, Codex runner, security, state persistence, workspace management, and configuration
All 8 MCP tools and 7 MCP resources
A security model with no-shell subprocess spawning, path traversal protection, environment allowlisting, and secret scrubbing
5 initial tests
Full README and contributor guide (AGENTS.md)

Key architectural decisions Codex made on its own:

Build on the MCP TypeScript SDK and the local codex CLI, not the OpenAI Agents SDK. The goal was a reusable MCP bridge, not an agentic application.
Use child_process.spawn() with shell: false and structured argv only. No shell injection surface.
Default to git worktree isolation (a lightweight clone of your repo) for coding runs. Codex works on a temporary copy of your project so your real files are untouched.
Never expose Codex’s --dangerously-bypass-approvals-and-sandbox flag through MCP. The bridge is supposed to be security-aware.

After the initial build, Codex did a second audit pass and strengthened the security model: fail-closed root validation, noPersist mode for in-memory-only runs, timeout and heartbeat enforcement, and a rule that documentation updates must accompany code changes.

Phase 2: Claude Code Tests and Breaks Things

With Codex’s implementation in hand, I switched to Claude Code (Opus 4.6) for the testing and hardening phase.

Claude Code’s first contribution was a full suitability analysis: architecture review, security assessment, MCP compliance check, macOS compatibility confirmation, and code quality review. It created a CLAUDE.md file documenting all of this.

Then came live MCP tool testing. Claude Code systematically tested all 8 tools through the running bridge. Six passed. Two failed:

codex_review_repo used a --output-schema flag that Codex CLI 0.118.0 does not support. The flag existed in the design but not in the actual CLI.
codex_continue_run did not check whether the source run was ephemeral (in-memory only, with no persisted session). Ephemeral runs have no server-side state to resume.

I fed these findings back to Codex, which fixed both bugs and added regression tests. This became the core development loop:

Codex builds or implements
Claude Code exercises it like a real MCP client
Claude Code surfaces behavioral mismatches
Codex fixes and adds tests

Phase 3: The Auth Discovery

The most useful bug Claude Code found was not in the bridge code itself. It was in how the bridge handled authentication.

I had authenticated Codex with a ChatGPT Plus subscription. When Claude Code ran a live test, it got 401 errors. The bridge was hardcoding its own isolated codexHomeDir (a design choice to keep the bridge’s state self-contained), completely ignoring the CODEX_HOME environment variable where my OAuth tokens lived.

The fix was one line in config.ts:

// Before:
codexHomeDir: path.join(baseStateDir, "codex-home"),
// After:
codexHomeDir: process.env.CODEX_HOME ?? path.join(baseStateDir, "codex-home"),

Simple, but you would not find this bug by reading the code. You find it by running the bridge with real credentials in a real environment. That is what the multi-agent workflow caught.

Phase 4: Security Hardening via Dual AI Consultation

For the security hardening pass, I used both approaches side by side: the new MCP Bridge (to delegate a task to Codex from any client) and my original /consult-codex Claude Code skill (which runs Codex and a code-searcher agent in parallel from the CLI). This was the first real comparison between the skill-based approach I built earlier and the MCP bridge approach.

Both were given the same question: audit the bridge’s secret-scrubbing security posture.

Both independently found the same gap: scrubSecrets() covered resources and stderr but missed tool response paths and stdout ingestion. Both also independently discovered that readRunSummary was unscrubbed.

Claude Code then implemented the fix across 3 source files, scrubbing at all 4 stdout ingestion points plus the readRunSummary boundary. Three new regression tests were added for the scrubbing changes. By the end of all sessions, the test suite had grown to 22 test cases across 6 test files.

The value here was not just finding the bug. It was having two different AI models independently confirm the same finding. When Codex and Claude Code agree something is wrong, you can be much more confident it is actually wrong.

Setting It Up in Claude Desktop

Getting the bridge running in Claude Desktop took some troubleshooting. The macOS sandbox means GUI apps do not inherit your shell’s PATH or nvm configuration, so a few things needed explicit configuration.

The config file lives at ~/Library/Application Support/Claude/claude_desktop_config.json. The working config looks like this:

{
  "mcpServers": {
    "codex-cli-mcp-bridge": {
      "command": "node",
      "args": ["/path/to/codex-mcp/dist/index.js"],
      "env": {
        "CODEX_CLI_MCP_ALLOWED_ROOTS": "/path/to/project-a:/path/to/project-b",
        "CODEX_CLI_MCP_STATE_DIR": "/path/to/codex-mcp/.codex-cli-mcp-bridge-test",
        "CODEX_HOME": "/Users/you/.codex",
        "CODEX_BIN": "/Users/you/.nvm/versions/node/v22.22.0/bin/codex",
        "CODEX_CLI_MCP_DEFAULT_WORKSPACE_MODE": "in_place"
      }
    }
  }
}

A few things to note:

CODEX_BIN points to the absolute path of the correct Codex binary (find yours with which codex in your terminal). Without this, Claude Desktop may find an older globally installed version.
CODEX_HOME tells the bridge where your Codex auth tokens live. Without it, the bridge uses an isolated directory and your credentials are invisible.
CODEX_CLI_MCP_ALLOWED_ROOTS is the safety boundary. The bridge rejects any task whose working directory falls outside these paths. Multiple paths are colon-separated.
CODEX_CLI_MCP_DEFAULT_WORKSPACE_MODE should be in_place for Claude Desktop’s sandbox, which blocks writing outside the project directory.

The most common first-run failures are: Claude Desktop finding the wrong Codex version (fix with CODEX_BIN), 401 auth errors from missing CODEX_HOME, and access denied when CODEX_CLI_MCP_ALLOWED_ROOTS does not include the directory you are working in.

See It In Action

Once connected, the bridge appears as a connector in Claude Desktop’s Cowork mode with 8 tools available. Codex CLI MCP isn’t the first AI CLI MCP I have created. I had created a Google Gemini CLI MCP server when Gemini CLI was first launched. I also recently created my own custom Obsidian MCP server and SKILL bundle too - subscribe to this Substack for notification of those articles 😉

From there, you can ask Claude to delegate tasks to Codex using natural language. Here is what it looks like when I asked Claude Cowork to have Codex read and critique one of my article drafts:

Codex reads the full draft, produces a structured summary and critique with strengths and weaknesses, and the results flow back into the Cowork conversation:

The practical value: I got a GPT-5.4-powered content critique without leaving Claude. The bridge handled all the session management, event streaming, and cleanup transparently.

I can now also get Claude Cowork to offload web searches and web browsing summarization to Codex CLI MCP bridge’s GPT-5.4.

Asking GPT-5.4 to do a web search for latest news:

Asking GPT-5.4 to summarize web site link:

Multi-turn Claude Cowork + Codex CLI MCP bridge research task done by OpenAI GPT-5.4 as it does it thing:

The Security Model (High Level)

The bridge is not a thin wrapper around codex exec. It enforces several security boundaries:

No shell spawning. All subprocesses use spawn() with structured arguments. No shell injection surface.
Path boundaries. Every task’s working directory is validated against allowed roots. Anything outside is rejected.
Secret scrubbing. Four regex patterns (API keys, bearer tokens, env-style secrets, auth headers) are applied at every data ingestion point as a defense-in-depth measure. This catches accidental leaks in Codex output, though it is not a substitute for proper credential management.
Environment allowlisting. Only explicitly approved environment variables are passed to the Codex child process.
Bounded persistence. Run metadata and events are stored with size limits. No unbounded disk growth.

These layers work together so that even if one is bypassed, the others still protect the system.

What Didn’t Work

Inspector limitations. MCP Inspector starts a fresh server process per invocation, which makes it good for startup testing but unreliable for testing cancellation or session continuity. We switched to persistent MCP client sessions for stateful tests instead.

Worktree creation in sandboxes. Claude Desktop’s sandbox blocks file writes outside the project root, which means git worktree isolation fails silently. The bridge now auto-detects sandbox restrictions and falls back to in_place mode, but it took a dual-AI consultation (both Codex and a code-searcher agent agreed on the approach) to get the fallback logic right.

CLI capability drift. Codex designed the codex_review_repo tool assuming --output-schema would be available. The actual CLI version (0.118.0) did not support it. The fix was to drop the flag and parse Codex’s native output directly, a reminder to test against the real binary rather than trusting documentation.

What I Learned

Multi-agent development loops work. Having one AI build and another AI test creates a feedback loop that catches bugs neither would find alone. The auth discovery, the scrubbing gap, and the review flag issue all came from this pattern.

MCP is a good abstraction for cross-model work. Instead of hacking together shell scripts or API wrappers, MCP gives you a standard protocol that any compatible client can use. The same bridge works with Claude Desktop, Claude Code, and opencode without any client-specific code.

Security needs multiple passes. The initial security model was strong, but each review (by Codex, by Claude Code, and by both together) found real gaps. No single pass was sufficient.

Documentation is part of the product. A bridge like this only becomes reusable once startup, configuration, auth, sandboxing, and troubleshooting are explained clearly. Both AIs contributed documentation alongside every code change.

What’s Next

The bridge is functional and tested. I am exploring whether the same MCP bridge pattern could work for other CLI-based AI tools beyond Codex. Any tool that speaks stdio is a candidate, though each will have its own quirks around auth, session management, and sandboxing.

Claude Cowork: Can Cheap AI Models Organize Your Photos?

George Liu — Thu, 09 Apr 2026 05:07:27 GMT

If you want to use AI to sort through folders of screenshots, family photos, product images, and illustrations, accuracy is everything. One hallucinated detail means a misfiled image. One misread action means a wrong label. At scale, small error rates compound into a mess that takes longer to fix than doing it by hand.

I ran a visual understanding benchmark across five AI models to find out which ones you can actually trust. Can a cheaper, faster model like Claude Haiku 4.5 or Gemini 3.1 Image Flash (Google Nano Banana 2) match Claude Opus 4.6 on visual accuracy - and what does it cost to run each at scale?

Some can. Some absolutely cannot. One model hallucinated a sad face that does not exist. Another read “Reconnect” as “Configure.” The gap between the best and worst was not subtle.

The short version, in this benchmark:

Tier 1 (trust unsupervised): Opus 4.6, Gemini 3.1 Image Flash -- zero hallucinations across all tests. $0 via Opus 4.6 Cowork, ~$0.10/image via Gemini 3.1 Image Flash API.
Tier 2 (spot-check needed): Sonnet 4.6, GPT-5 Image -- solid understanding, occasional factual errors. $0 via Sonnet 4.6 Cowork, ~$0.03/image via GPT-5 Image API.
Tier 3 (not usable): Haiku 4.5 -- wrong actions, hallucinated objects, unread text in every test.

Here is how each model handled the four hardest images.

Claude Opus 4.6, Sonnet 4.6, Haiku 4.5, Gemini 3.1 Image Flash (Google Nano Banana 2), and GPT-5 Image understanding of images

The Benchmark

Eight images spanning different categories: product photos, family portraits, comic-style illustrations, and a terminal screenshot. Each fed to five models with an identical prompt asking for a plain text description (100-200 words) and a JSON structured breakdown (model chooses its own schema).

Claude models (native multimodal vision via Cowork):

Claude Opus 4.6
Claude Sonnet 4.6
Claude Haiku 4.5

External models (via my ai-image-creator skill’s –analyze flag, routed through OpenRouter):

Gemini 3.1 Image Flash (Google Nano Banana 2)
GPT-5 Image (OpenAI)

The Claude models saw each image directly through Cowork’s native multimodal Read tool. The external models received each image through OpenRouter’s API. Same prompt, same images, different inference paths.

Below are four images that produced the sharpest divergences, each testing a different dimension of visual understanding.

Test 1: Can It Read a Screenshot?

Image: Claude Code MCP server status panel

This terminal screenshot is dense with readable text: a version number, a model identifier, a file path, a server name, connection fields, and three menu options. Either a model reads the text accurately or it does not.

Opus 4.6:

A screenshot of a terminal interface with a dark/black background displaying the Claude Code command-line tool. At the top left, a small orange pixel-art icon (resembling a space invader character) sits next to the heading “Claude Code v2.1.92.” A status line reads “Opus 4.6 (1M context) with medium effort - Claude Max” followed by a file path “/Volumes/AMZ3/AI-vibe-coding/codex-mcp.” […] three interactive options appear in cyan/blue text: “1. View tools”, “2. Reconnect”, and “3. Disable.”

Every detail correct: file path, version, menu options, even the pixel-art icon.

Sonnet 4.6:

A screenshot of a terminal or command-line interface showing the Claude Code application, version 2.1.92. […] There is a list of 3 options at the bottom: “1. View tools”, “2. Configure”, “3. Disable”.

Version number correct, but Sonnet hallucinated “Configure” for option 2 (the actual text says “Reconnect”) and misread the file path as “/Users/codex-cl/mcp-bridge/codex-mcp.” Two factual errors on clearly visible text.

Haiku 4.5:

A terminal window displays technical command-line output related to Claude Code and MCP server configuration. […] Text appears to show Claude Code version information and MCP server setup details, including path references and configuration parameters.

Haiku gave up on reading the text entirely. No version number, no file path, no menu options, no server name. It described the image at a meta level (“there appears to be text”) without extracting any of it.

Gemini 3.1 Image Flash:

This is a screenshot of a terminal interface […] text reads “Claude Code v2.1.92” and below it, “Opus 4.6 (1M context) with medium effort - Claude Max.” […] option “1. View tools,” highlighted in blue text, beside other grayed-out options “2. Reconnect” and “3. Disable.”

Fully accurate. Gemini even provided hex color codes (#0a0a09, #ffffff, #6a99b2) for the UI elements.

GPT-5 Image:

The image is a crisp screenshot of a terminal-style interface […] “Claude Code v2.1.92.” A status line follows: “Opus 4.6 (1M context) with medium effort - Claude Max,” and a file path “/Volumes/AMZ3/AI-vibe-coding/codex-mcp.” […] interactive options in blue: “1. View tools,” “2. Reconnect,” “3. Disable.”

Also fully accurate, including the exact file path.

Bottom line: For reading text in screenshots, only Opus, Gemini, and GPT-5 are reliable. Sonnet makes critical substitution errors. Haiku cannot read screenshot text at all.

Test 2: Does It Understand What People Are Doing?

Image: Family birthday celebration with toddler and cake

Five people at a birthday party. The key action: an adult holds a toddler who is reaching toward a cake with a single lit #1 candle. This tests whether models understand what is happening, not just what objects are present.

Opus 4.6:

A woman holds a toddler in a green shirt and red polka-dot party hat who reaches curiously toward the cake and candle. […] On the far left, a girl in a pink polka-dot party hat is seen in profile, appearing to blow through a noisemaker.

Correct action, correct #1 candle, correct people count, correct alamy watermark.

Sonnet 4.6:

The mother (center-back) holds a toddler boy who reaches toward the cake; an older boy leans in from behind; and a girl (far left, back to camera) looks on.

Core action correct. Also spotted the stock watermark code “3BJ8HFJ,” a detail most models described only as “alamy.”

Haiku 4.5:

A young boy in the center blows out candles on the cake while surrounded by family members showing genuine expressions of joy and celebration.

Wrong. The toddler is reaching, not blowing. There is one #1 candle, not multiple. Haiku also added “balloons” to its JSON. There are no balloons in this photo. Getting the action wrong means mislabeling this as “boy blowing out birthday candles” instead of “toddler’s first birthday, reaching for cake.”

Gemini 3.1 Image Flash:

A toddler in a green shirt and a red polka-dot party hat, held by a smiling woman leaning towards him. He is reaching out to point at a small, candlelit birthday cake centered on a red plate.

Clean and precise. Correct action.

GPT-5 Image:

Another adult holds a toddler who reaches curiously toward the candle […] A single lit red candle shaped like the number 1.

Correct, including the #1 candle shape.

Bottom line: Only Haiku got the central action wrong. For sorting family photos by event or milestone, that distinction matters.

Test 3: Can It Find Details in a Busy Scene?

Image: “Second Brain Online” comic illustration

This illustration is packed with text: two monitors with data views, a corkboard of labeled sticky notes, a glowing brain graphic, a mug, and papers on the floor. It tests raw text extraction from a cluttered scene.

Opus 4.6:

The left monitor displays a “KNOWLEDGE GRAPH” with an interconnected network visualization. The right monitor shows “DATAVIEW: 8 NOTES” […] notes reading “SESSION LOG”, “IDEAS BANK”, “GROWTH PLAN”, “IT REMEMBERS!”, and “CONTEXT LOADED”, all with green check marks. […] A coffee mug reads “MEMORY UNLOCKED” and a paper on the floor says “SESSION 1 OF MANY.”

Nearly every piece of text extracted.

Sonnet 4.6:

A corkboard on the left is neatly filled with sticky notes including “DECISION LOG”, “IDEAS BANK”, “GROWTH PLAN” […] The laptop screen shows “DATAVIEW: 8 NOTES”. A mug on the desk reads “MEMORY UNLOCKED”.

Missed “IT REMEMBERS!” and “CONTEXT LOADED.” Also misread “SESSION LOG” as “DECISION LOG.” Misreading labels means miscategorizing.

Haiku 4.5:

An illustrated cartoon shows a professional man wearing glasses, seated at a desk working on a laptop. To his right is a prominent visual representation of a brain rendered in purple/violet tones […] labeled “SECOND BRAIN.”

Almost no text extracted. Did not mention the monitor labels, corkboard notes, mug text, or floor paper. Worse, it described the workspace as “scattered with papers” when the illustration deliberately shows an organized desk. Haiku misread the mood of the image.

Gemini 3.1 Image Flash:

The left monitor displays “KNOWLEDGE GRAPH” […] The right monitor shows “DATAVIEW: 8 NOTES” […] post-it notes with handwritten text and tick marks like “SESSION LOG”, “IDEAS BANK”, “IT REMEMBERS!”, “CONTEXT LOADED”.

Caught the checkmarks on the notes, a small visual detail that enriches understanding.

GPT-5 Image:

A bulletin board is pinned with labeled cards: “SESSION LOG,” “IDEAS BANK,” “GROWTH PLAN,” “CONTEXT LOADED,” “14 TOOLS, 0 BUGS,” and a central sign: “IT REMEMBERS!”

GPT-5 found text that every other model missed: “14 TOOLS, 0 BUGS.” For exhaustive text cataloging, GPT-5 dug deepest.

Bottom line: Each model has a different ceiling for detail extraction. GPT-5 found the most but was slowest. Gemini was nearly as thorough. Haiku could not extract meaningful text from a busy scene.

Test 4: Can It Understand Visual Metaphors?

Image: “The Amnesia Problem” comic illustration

The hardest test. This illustration uses visual metaphors: a thought bubble with code fragments dissolving into smoke alongside a split brain/lightbulb icon, sticky notes about identity loss, a mug labeled “SESSION 47,” and papers stamped “FORGOT.” Understanding this image means interpreting what the elements symbolize.

Opus 4.6:

A thought bubble above him contains fragmented code snippets on the left dissolving into smoke, and a split brain/lightbulb icon on the right symbolizing fading ideas. […] A steaming coffee mug labeled “SESSION 47.” […] a corkboard displays sticky notes saying “WHO AM I?”, “CONTEXT LOST”, “START OVER… AGAIN”, and “LOST.”

Full symbolic understanding. Identified the brain/lightbulb as “symbolizing fading ideas,” caught the laptop sticky note “NOTE TO SELF: REMEMBER EVERYTHING,” and described the theme as “developer amnesia when losing session context.”

Sonnet 4.6:

Above their head is a dark stormy thought cloud containing fragmented, confused symbols suggesting mental overload. […] a mug on the desk is labeled “SESSION 47”.

Understood the general theme but described the thought bubble generically (“fragmented, confused symbols”) without identifying the brain/lightbulb metaphor or the code-to-smoke transition. Also misread “CONTEXT LOST” as “CONTEXT OVER LOST.”

Haiku 4.5:

Above his head hovers a gray cloud icon with a sad face, symbolizing mental fog, memory loss, or cognitive overwhelm. […] a calendar showing “47.”

The biggest miss in the entire benchmark. There is no sad face in the thought bubble. It contains code fragments, smoke, and a brain/lightbulb. Haiku invented a visual element and misidentified the “SESSION 47” mug label as being on a calendar.

Gemini 3.1 Image Flash:

A thought bubble above him shows complex code snippets transitioning into thick dark smoke on the left, contrasting with a split logical/creative brain-bulb radiating light on the right. His main laptop screen simply displays “NEW SESSION” with a flashing cursor, a sticky note on the keyboard area reading “NOTE TO SELF: REMEMBER EVERYTHING”.

The most precise thought-bubble description of all five models. “Split logical/creative brain-bulb radiating light” captures both the form and the meaning.

GPT-5 Image:

A thought cloud above the character contains fragments of code and UI diagrams, while a glowing brain-shaped lightbulb hovers near their head, symbolizing fading ideas. […] “NOTE TO SELF: REMEMBER EVERYTHING.”

Correctly identified the metaphor and the laptop sticky note.

Bottom line: Visual metaphor interpretation is where the quality gap is widest. Haiku hallucinated elements that do not exist and misattributed text to wrong objects. Gemini produced the single most precise description in the entire benchmark.

What I Learned

The results fell into three tiers, and accuracy was the dividing line.

Tier 1 - trustworthy unsupervised in this benchmark: Opus 4.6 and Gemini 3.1 Image Flash (Google Nano Banana 2). Neither hallucinated, misread actions, or invented objects across any of the four tests. Opus was the most comprehensive overall. Gemini was sometimes more precise on specific details. Either could run unsupervised across a batch of images.

Tier 2 – good but needs spot-checking: Sonnet 4.6 and GPT-5 Image. Both understood images well, but each made at least one factual error that would result in a misfiled image. For batch work, review a sample before trusting the full run. GPT-5 was the best at exhaustive text extraction but averaged 20-40 seconds per image.

Tier 3 - not usable for this task at this level of complexity: Haiku 4.5. Its errors were not edge cases. They were the central details of each image: the wrong action, hallucinated objects, unread text, misidentified symbols. A batch run with Haiku would require full manual review, defeating the purpose.

What It Costs

The accuracy tiers matter more once you factor in cost. Here is what each model costs per image analysis, from my OpenRouter logs:

Gemini 3.1 Image Flash (Google Nano Banana 2): ~$0.10 per image (~400 input, ~2,100 output tokens). Surprisingly the most expensive API option.

GPT-5 Image: ~$0.03 per image (~1,000 input, ~2,000 output tokens). About 3x cheaper than Gemini.

Claude models via Cowork: $0 marginal cost on a Claude Pro or Max subscription. No API call, no per-token charge. Pro gives access to all three Claude models; Max raises usage limits for heavier workloads.

For a 1,000-image batch: Cowork costs $0 extra, GPT-5 costs ~$30, Gemini Flash costs ~$100. Response times: near-instant (Cowork), 6-14 seconds (Gemini), 19-40 seconds (GPT-5). A 1,000-image GPT-5 run at 30 seconds each is over 8 hours.

If you are on Claude Pro or Max, Opus through Cowork is the clear winner: Tier 1 accuracy at zero marginal cost. If you need API access outside Cowork, Gemini Flash has matching accuracy but at a premium. GPT-5 is the budget API option with Tier 2 accuracy.

How the JSON Structures Compared

If you want to automate folder organization by parsing model output programmatically, the JSON structure matters as much as the description. Here is the same screenshot (the Claude Code MCP panel) in JSON from three models:

Opus 4.6 extracted every field with exact values:

{
  "text_content": {
    "header": ["Claude Code v2.1.92",
      "Opus 4.6 (1M context) with medium effort - Claude Max"],
    "file_path": "/Volumes/AMZ3/AI-vibe-coding/codex-mcp",
    "panel_title": "Codex-cli-mcp-bridge MCP Server",
    "status_fields": {
      "Status": "connected (blue checkmark)",
      "Command": "npx",
      "Args": "-y codex-cli-mcp-bridge",
      "Config location": "Dynamically configured",
      "Capabilities": "tools - resources",
      "Tools": "8 tools"
    },
    "interactive_options": ["1. View tools",
      "2. Reconnect", "3. Disable"]
  }
}

Every value accurate and machine-readable. You could pipe this into a script to sort screenshots by application and version.

Gemini 3.1 Image Flash (Google Nano Banana 2) was equally accurate with a different schema choice, and added hex color codes:

{
  "text_content": {
    "header_text": ["Claude Code v2.1.92",
      "Opus 4.6 (1M context) with medium effort - Claude Max",
      "/Volumes/AMZ3/AI-vibe-coding/codex-mcp"],
    "box_title": "Codex-cli-mcp-bridge MCP Server",
    "status_details": [
      {"field": "Status", "value": "connected",
        "value_style": "default with blue check"},
      {"field": "Command", "value": "npx"},
      {"field": "Tools", "value": "8 tools"}
    ],
    "prompt_options": [
      {"number": "1", "text": "View tools", "highlighted": true},
      {"number": "2", "text": "Reconnect"},
      {"number": "3", "text": "Disable"}
    ]
  }
}

Array of field/value objects instead of a flat map. Both schemas are trivially parseable.

Haiku 4.5 produced this:

{
  "subject": "Terminal/CLI screenshot showing Claude Code and MCP configuration",
  "content_type": "Technical configuration/setup",
  "software_referenced": ["Claude Code", "MCP", "MCP Server"],
  "technical_elements": ["version info", "file paths",
    "configuration parameters"],
  "readability": "Technical text, somewhat small"
}

No version number, no file path, no server name, no menu options. The field “technical_elements” says “version info” exists but not what the version is. For any automated pipeline, this output requires a second pass with a more capable model.

What Didn’t Work

The benchmark prompt let models choose their own JSON schema, which made direct comparison harder. A future benchmark should provide a fixed schema and test whether models can fill it accurately.

The image set was also weighted toward complex illustrations (two of the four highlighted). Simpler images like product shots and portraits showed less divergence. A production benchmark should include more of the “boring” images people actually need organized: receipts, documents, app screenshots.

What’s Next

This was benchmark 1 with eight images and free-form output. Next, I am testing all five models against a fixed JSON schema on 50-100 images to get real accuracy rates for automated folder organization.

George Liu

ChatGPT Codex Flagged My Security Code. Here’s How OpenAI Trusted Access for Cyber Works

What Trusted Access for Cyber is

How verification works

What changes after verification

Worth knowing before you verify

DeepSeek V4 in Claude Code, Kilo Code, OpenCode: 3-Way AI Verification With GPT-5.5

What DeepSeek V4 offers

Pricing (as of May 2026)

Getting your DeepSeek API key

Setting up DeepSeek V4 in Claude Code

The standard approach (replaces Claude)

The deepcc shell function (runs DeepSeek alongside Claude)

DeepSeek V4 Web Search In Claude Code

DeepSeek V4 Claude Code Token Usage Metrics

Setting up DeepSeek V4 in Kilo Code

Setting up DeepSeek V4 in OpenCode

The /consult-codex-deepseek skill: 3-way verification

Why three opinions instead of two

How the skill works

The parallel invocation

What the output looks like

Prerequisites for the skill

What to watch out for

DeepSeek rate limits are dynamic

Thinking mode is on by default

Error codes to know

The --bare gotcha (worth repeating)

When to use which tool

What I learned

I Saved $7,189 on Claude Code Tokens. Here’s Every Efficiency Habit That Mattered

The #1 mistake killing your token budget: one session for everything

Context rot: the silent quality killer

Why auto-compacting doesn’t save you

The token math

The /resume trap

The fix

The single biggest lever: plan mode

The mechanism: how plan mode drives cache savings

My plan mode workflow

Front-load constraints, not just goals

Context management: maintaining what plan mode creates

/rewind: erase failed attempts instead of correcting

/compact: proactive, not reactive

/compact vs /clear: know when to use which

Cache TTL and session rhythm

Shift usage to off-peak hours

/btw for side questions

Sub-agents for heavy exploration

The effort knob: your biggest cost control

Match effort to task complexity

Thinking keywords

CLAUDE.md: write it once, benefit every session

Keep it focused

The multi-model caveat

File hierarchy and version control

Parallel development with worktrees

More slash commands worth knowing

/debug

/context

/simplify

/batch

Non-interactive mode for automation

MCP servers: the hidden context tax

Verification: let Claude check its own work

What I measured across 220 sessions

The habits that compound

What’s next

Claude Cowork Live Artifacts: First Run With My Claude Code Metrics MCP Server

What a live artifact is, in one paragraph

The three properties that actually matter

Where the Metrics MCP server fits in

Step 1: adding the server to Claude Desktop

Step 2: testing it in Claude Code first

Step 3: first time inside Live artifacts

Step 4: probe the tools before you build

Step 5: approving the artifact

Step 6: the finished dashboard

What didn’t work

What I learned

The `deepcc` shell function (runs DeepSeek alongside Claude)

The `--bare` gotcha (worth repeating)

The `max` rung is its own phenomenon