Claude Plans + Local AI Executes — Planner-Executor Pattern for Enterprises

The question Thai enterprise IT teams keep asking is: "Claude is expensive, Local AI is not capable enough — is there a way to use both together?" There is, and it's called the Planner-Executor Pattern (sometimes called orchestrator-worker) — let Claude (cloud, very capable but token cost is real) handle thinking, planning, and breaking tasks down, then hand off to a Local AI (Qwen, DeepSeek, Llama running on an office server) to do the actual implementation. This article unpacks how the pattern works, where the win comes from, what hardware you actually need, and the trade-offs to understand before deploying.

Quick summary — what Planner-Executor is / isn't

✓ It is: a design pattern that splits "thinking" (Claude in the cloud) from "doing" (Local AI). Claude writes a detailed plan; Local AI runs the implementation with its own tool use.

✗ It isn't: not fine-tuning Claude on your own servers, not replacing Claude with Local AI entirely, and not every Local AI can play this role — the executor model has to be strong enough at instruction following and tool use.

How the Pattern Works — Four Real Stages

The heart of this pattern is the two-role split: Claude plays the "senior engineer" who reads requirements, designs, and writes the step-by-step plan; Local AI plays the "operations engineer" who follows the plan without having to invent the design. The flow looks like this:

Stage	Who	Work	Output
1. Intake	Claude	Takes the request, asks clarifying questions if needed	Clear requirement
2. Plan	Claude	Decomposes into sub-tasks, names files to change, lists edge cases	Structured plan (JSON / markdown)
3. Execute	Local AI	Reads files, edits code, runs tests against the plan via its own tool use	Code diff + execution log
4. Verify	Claude	Reviews diff and log against the plan, retries if mismatched	Approve or send back to executor

Stages 1, 2, and 4 use Claude — but they burn far fewer tokens than stage 3, because thinking and reviewing take fewer characters than actually implementing. The most token-heavy stage, stage 3, moves to Local AI — and your cloud API bill drops materially.

A worked example

Say the task is "refactor the checkInventory function to support multi-warehouse." Claude reads the relevant code (say 8 files), designs a plan that names what to edit in each file and in what order, then sends a ~2,000-token plan to Local AI. Local AI does the real editing — reading and writing files (which can run into hundreds of thousands of tokens) — and returns the diff. Claude reads the diff (~5,000 tokens) and approves or asks for revisions. Net: Claude burned ~10,000 tokens instead of the ~150,000 it would have used doing the whole thing alone.

Why Split — Three Options Compared

Organisations that want to put AI to work have three main paths: cloud-only, local-only, or hybrid planner-executor. Each has different strengths and weaknesses.

Dimension	Cloud-only (Claude only)	Local-only (Qwen / DeepSeek only)	Hybrid (Planner-Executor)
Plan quality	Highest	Mid — smaller models miss edge cases	Highest (Claude plans)
Token cost	High	Near zero (electricity + hardware)	Much lower than cloud-only (plan + verify only)
Code leaves the org?	Yes — every request	Never	Only short plan + diff, never the full source
Latency	Good (Anthropic infra)	Fast for short inference, slow with long context	Cloud + local round-trip (slower than either)
Hardware	None	Beefy GPU (≥24GB VRAM)	Beefy GPU
Setup complexity	Very easy (API key)	Medium (Ollama / vLLM)	Most complex — orchestrator + two backends

Hybrid is clearly not the simplest path — it's the path that cuts cost and limits data egress at the cost of setup complexity. If your data sensitivity is low or you haven't deployed AI broadly yet, cloud-only is probably the better deal. But if AI is part of daily work for a dev team of 10+, and your code is core IP, this is the direction many enterprises are heading.

What Hardware You Actually Need — by Model Size

The factor that makes this pattern "actually playable" is that your Local AI has to be strong enough — especially at instruction following and tool use. Without that, no matter how clean Claude's plan is, the executor will fall over mid-way. The models that genuinely work as executors for coding/agent work in 2026 are:

Model	Params	VRAM (4-bit quant)	Fit
Qwen2.5-Coder-7B	7B	~6GB	Small tasks like autocomplete — instruction following isn't strong enough to be an executor
Qwen2.5-Coder-14B	14B	~10GB	Single-file refactors — the floor for usable executor work
Qwen2.5-Coder-32B	32B	~22GB	Multi-file work + tool use — the sweet spot for this pattern
DeepSeek-V3	671B (MoE, 37B active)	~380GB (FP8)	Enterprise workloads — requires a multi-GPU cluster
Llama 3.3 70B	70B	~42GB	General-purpose executor — instruction following is very strong

Hardware decides which model you can run — not the other way around — because GPU is the dominant cost. Practical pairings in Thai enterprises:

Hardware	VRAM	Models it runs	Suitable for
RTX 4090	24GB	Qwen2.5-Coder-32B (tight)	Dev team of 1–3
RTX 5090	32GB	Qwen2.5-Coder-32B (comfortable) / Llama 70B (heavy quant)	Dev team of 3–5
A6000	48GB	Llama 70B (4-bit) comfortably	Team of 5–10, on-prem inference server
Mac Studio M3 Ultra 256GB	~192GB unified	DeepSeek-V3 (4-bit)	R&D, low concurrency
H100 / B200 cluster	80–192GB ×N	DeepSeek-V3 FP8, multi-tenant	Large org shared infra

Most Thai enterprises piloting this pattern start with an RTX 5090 or A6000, because they're cost-effective and they fit Qwen2.5-Coder-32B — the executor model that "actually works" for general coding work. For details on deploying models on your own server see Ollama Self-Host — Security Concerns to Watch For and DeepSeek Self-Host — Deploying on Enterprise Hardware.

Tools That Implement This Pattern Today

As of May 2026, several frameworks support a planner-executor architecture out of the box — you don't have to write the orchestrator from scratch:

Tool	Role	Suitable for
Claude Agent SDK	Framework to build agents that use Claude as planner and delegate to sub-agents on different models	Dev teams that want to control the flow themselves
Ollama	Runs the Local AI model and exposes an OpenAI-compatible API for the orchestrator to call	Any setup that uses Local AI
vLLM	Serving framework for Local AI with much higher throughput than Ollama	Production multi-user serving
MCP (Model Context Protocol)	A common protocol both Claude and Local AI use to talk to the same tools and data sources	Setups that want plug-and-play across layers
LangGraph	Graph-based orchestrator that makes the planner → executor → verifier flow explicit	Teams with complex workflows that need resumable runs

A practical starting stack for Thai enterprises: Claude Agent SDK (planner) → calls Ollama hosting Qwen2.5-Coder-32B on an RTX 5090 (executor). Every tool call the executor needs goes through the same MCP server the planner uses — so both layers see the same context. For more on Ollama see What is Ollama — Local AI on Your Own Machine.

⚠️ Trade-offs and Risks Before You Deploy

This pattern is not a silver bullet — it solves cost and privacy, but creates four new problems that real teams hit in practice.

1. Plan-Execute mismatch

Claude writes a perfect plan, but Local AI reads it incompletely, follows it halfway, then improvises. The root cause is usually that the executor model isn't strong enough. The fix: verify every batch and retry whenever the diff doesn't match the plan.

2. Round-trip latency

What took 10 seconds in cloud-only might take 30–60 seconds in hybrid because you're waiting on two network legs plus Local AI inference. It's a poor fit for real-time UX, but an excellent fit for batch and background jobs.

3. Context drift between the two models

Claude has a huge context window (200K–1M), but most Local AI models sit at ~32K–128K. If the plan is long and the executor also needs to read code, important context can be truncated — exactly the kind of problem Claude Dreaming — When AI Starts to Dream describes for memory consolidation.

4. Debugging gets twice as hard

When the output is wrong, you have to figure out whether the plan was wrong (Claude) or the execution was wrong (Local). You need logging across both layers and a replay mechanism — otherwise the two layers just blame each other and you can't reach root cause.

Risk	Mitigation	Who owns it
Plan-Execute mismatch	Claude always verifies the final step; set a retry budget	Dev / Orchestrator
Round-trip latency	Use for batch / background work, not real-time UX	Architect
Context drift	Keep plans short, split sub-tasks small, summarise files	Planner (Claude)
Debugging complexity	Centralised logging, replay infra, full-stack observability	DevOps

Use Cases in Thai Enterprises — Where to Start

If your team has never built an AI agent before, don't start this pattern in production right away. Begin with tasks that are low impact and easy to verify. Three good starting use cases:

A. Code review and refactor (low-risk)

Claude reads the PR and writes comments on what should change → Local AI applies the changes as a draft commit → a human reviews before merge. The worst case is a discarded draft commit, not a broken deploy.

B. Data migration scripts

Claude analyses old/new schema and writes the migration plan → Local AI generates the SQL/Python script per the plan → dry-run it in staging. A batch job, so latency isn't a problem.

C. Documentation generation

Claude lays out the outline (by reading the codebase) → Local AI writes the detail for each section → Claude reviews the whole. There's almost no downside even if the result isn't perfect.

After those three are working, you can move on to higher-risk tasks: test-case generation, module refactors, or implementing small features from spec. The idea of AI working as a team of distinct roles is close to what AI Agent Team — When AI Works as a Team describes for multi-agent setups.

How Saeree ERP Looks at This Pattern

Saeree ERP is building an AI Assistant specifically for ERP work — and internally, our own team uses a hybrid pattern close to what this article describes for development and maintenance. Many of our customers are organisations where code and database schema cannot leave the premises (government agencies, state enterprises, manufacturers with sensitive IP). Letting Claude plan while Local AI implements under in-house supervision is a way to respect data residency without giving up the quality of thinking.

For customers running Saeree ERP on-premise, we see this pattern as the roadmap for opening up AI features in the live system in the next phase — because it directly addresses the "we want AI, but data can't leave" problem.

Summary — The Upside and the Downside of Hybrid

Upside	Downside / risks
Cloud API cost drops a lot (only plan + verify go to the cloud)	Hardware + orchestrator investment up front
Most code and data stay in-house — data residency intact	Local AI must be capable — anything under 14B usually isn't enough
Use each model exactly where it's strongest	Total latency is 2–3× cloud-only
Scales with executors — not tied to vendor quotas	Debugging gets harder — you trace planner and executor both
Swap executor models in the future without touching the planner	Needs SRE/DevOps who can run a GPU server

"Using the smartest AI where thinking matters, and the cheapest AI where doing matters — that's the pattern that will define enterprise AI economics for the next two to three years."

A Question to Sit With

Of every request your team sends to a cloud AI today — what percentage is actually "follow these steps" work that doesn't need Claude-grade reasoning? And if you moved just that slice to Local AI, what would your monthly token cost look like? That number is what you should evaluate before committing to hardware.

If your organisation is evaluating AI architecture for use with core internal systems — ERP, HR, or internal automation — book a consultation with the Saeree ERP team to help design a deployment that fits your data policy and scale.

References

Anthropic — Claude Agent SDK Overview
Anthropic — Building Effective Agents (orchestrator-workers pattern)
Ollama — Qwen2.5-Coder model card
DeepSeek — DeepSeek-V3 technical report
vLLM — vLLM serving documentation
Model Context Protocol — MCP specification

Claude Plans + Local AI Executes — Planner-Executor Pattern

How the Pattern Works — Four Real Stages

Why Split — Three Options Compared

What Hardware You Actually Need — by Model Size

Tools That Implement This Pattern Today