💻AI

I Replaced Claude with a Local Model for Daily Coding. Here's What Actually Happened.

Michael Sintim-Koree · June 2026

The Ask HN thread on this topic keeps coming back because the answer people want ('yes, local models are good enough, you can ditch the API') keeps bumping into the answer people actually get, which is more complicated. After spending several months running local models as a primary coding assistant, a real opinion on this is possible.

The short version: for most day-to-day work, yes. For specific tasks, no, and the 'no' cases aren't random. They cluster in predictable ways.

What I'm actually running

The hardware is a workstation with an RTX 4090 (24GB VRAM) and 64GB of system RAM. I use Ollama as the model server and Continue as the IDE extension in VS Code. Continue's OpenAI-compatible endpoint points at localhost:11434 with a placeholder key, so the model switch is transparent to the IDE.

After testing most of the viable options, I settled on two: Qwen2.5-Coder-32B-Instruct at Q4_K_M quantization for general coding work, and DeepSeek-Coder-V2-Lite for anything latency-sensitive where near-instant completions matter. The 32B Qwen model spills partially into system RAM at Q4 quantization (about 18GB into VRAM, the rest into RAM), and inference is slower than a pure VRAM-resident model, but still fast enough to not be annoying. Benchmarks for similar hardware suggest roughly 20–25 tokens per second on this setup, which is fine for chat-style interactions.

What local models handle well

The use cases where reaching for Claude is no longer necessary:

Boilerplate generation: CRUD endpoints, schema migrations, test scaffolding. The model knows these patterns cold and produces correct output quickly.
Code review passes on a function or class before a PR. 'What am I missing here, what are the edge cases' — local models are reliable for this on code up to a few hundred lines.
Explaining unfamiliar codebases or third-party library internals. Pasting in source code and asking questions works well; the model reasons over it fine.
Regex and SQL. Both are well-represented in training data and the problems are bounded. Local coding models hit parity with frontier models here, in practice.
Anything involving proprietary code you'd rather not send to a cloud endpoint. Local wins automatically; no quality comparison needed.

That list covers probably 70–80% of typical daily coding assistant usage. For those tasks, Qwen2.5-Coder-32B is genuinely good. Not 'good for a local model.' Just good.

Where local models still fall short

The clearest gap is multi-file reasoning. When the task is understanding how a change in one module propagates through several others, or doing architecture-level refactoring across a service boundary, local models at 32B degrade noticeably compared to Claude Sonnet or GPT-4o. The context window isn't the issue; the quality of reasoning over that context is. Claude Sonnet keeps more coherent track of cross-file dependencies than anything available locally at this weight class. This shows up most clearly on TypeScript projects with complex type hierarchies and on Python codebases with heavy dependency injection.

Novel API integration from sparse docs is the second failure mode. If a library is well-represented in training data (FastAPI, SQLAlchemy, React, standard AWS SDK patterns), local models are fine. For integrating something niche, a less-documented SaaS API, or an SDK that shipped after the model's training cutoff, quality drops fast. Local models hallucinate plausible-but-wrong method signatures more often in these cases. Claude handles this better, partly through stronger underlying reasoning and partly through better generalization from documentation patterns.

The third category is subtle runtime behavior. Stack traces and error messages: fine. Concurrency issues, memory behavior, race conditions where the problem requires holding a lot of implicit context about the execution model: that's where a frontier model earns its API cost. The step from 'here's the symptom' to 'here's why this specific code path produces that symptom under this runtime condition' is harder for local models, and the quality difference is real enough to be worth the API call.

Model selection matters more than parameter count

A lot of 'local models aren't good enough' conclusions come from people running the wrong model for the task. Llama 3.1 8B is fast and capable for general text tasks, but it isn't purpose-built for code the way specialist models are. Mistral 7B is a good general base. Neither is what you want if the task is generating correct Python with proper error handling and test coverage.

Qwen2.5-Coder and DeepSeek-Coder-V2 are both specifically trained on code, and it shows. The difference between them and a general-purpose model on coding tasks is larger than the difference between general-purpose models of different sizes. For coding-specific work, Qwen2.5-Coder-7B punches above its weight class; it beats Llama 3.1 70B on HumanEval despite being a fraction of the size.

Test on your actual inputs, not benchmarks. Leaderboard scores on HumanEval or MBPP tell you something, but they don't tell you how a model performs on your specific codebase, your frameworks, your style of problem. Running the same set of real tasks you do regularly against a few models before committing: that 30-minute evaluation will be more useful than any benchmark table.

The workflow I actually use

I don't run local-only. The setup is tiered: local model handles the default case, cloud API is a deliberate escalation for specific situations.

Continue's model switching makes this practical. I have Qwen2.5-Coder-32B set as the default and Claude Sonnet available as an alternate profile. Switching takes two clicks. I use the local model by default and switch to Claude when the task involves multi-file reasoning at scale, when integrating a new API with sparse documentation, or when the local model has already failed to crack a debugging problem in one or two attempts. The escalation is deliberate, not reflexive.

That pattern has cut my cloud API spend by roughly 80% compared to using Claude as the default. The tasks that genuinely need frontier model quality are a minority of total requests, even though they're disproportionately the harder and more interesting ones.

The privacy case is stronger than the cost case

Most of the conversation around local coding models focuses on quality and cost. The privacy argument gets less attention, and that framing seems backwards.

When you paste code into Claude, that code (function signatures, variable names, business logic, database schema) gets processed on Anthropic's infrastructure. API access carries a no-training policy, meaning inputs and outputs are not used to train models. That's still code leaving your machine. For personal projects, it's easy not to care much. For work projects involving proprietary systems, customer data structures, or anything under NDA, the answer is simple: it doesn't go to a cloud endpoint, full stop. Local models make that the default rather than something requiring discipline and exception management every time. Good security habits are easier to maintain when the secure path is also the path of least resistance.

What hardware you actually need

The 4090 workstation isn't the only viable path. Apple Silicon M-series machines are a serious option. An M3 Max or M4 Max with 48GB or 64GB of unified memory runs Qwen2.5-Coder-32B entirely in-memory with no RAM spill, and the bandwidth characteristics of unified memory mean inference is faster than you'd expect compared to discrete GPU setups at the same parameter count. Several people in the HN thread are running exactly this on MacBook Pros.

The floor for useful local coding assistance is lower than people assume. A machine with a 16GB GPU (an RTX 4080 or 4060 Ti 16GB) can run Qwen2.5-Coder-14B at Q4 quantization comfortably. That's a real coding model producing real results. The 7B variants run on almost anything, fitting in under 10GB of VRAM at Q4_K_M.

Quality scales with hardware, but 'genuinely useful' starts well below a high-end rig. The conversation tends to anchor on high-end setups because those are the people writing the posts. Don't let that set your expectations for the minimum viable configuration.

The honest answer to the HN question

Yes, local models can replace Claude and GPT for daily coding, with a clear-eyed understanding of where they can't. The majority of routine coding assistance tasks fall within what local models handle well today. The tasks that still benefit from frontier model quality are real but narrower than the general conversation suggests.

'Local instead of cloud' is the wrong frame. Build a setup where local is the default, understand the specific cases where frontier model quality is worth the API cost and privacy tradeoff, and switch deliberately rather than defaulting to cloud for everything because it's the path of least resistance.

The multi-file reasoning gap that currently pushes me to Claude is the thing worth watching most closely. Open weights coding models have improved faster than expected over the past year, and it wouldn't be surprising if the next generation of 32B models closes it enough to change the calculus. Qwen2.5-Coder turned out to be better than most people anticipated when it launched, so any predictions about the next generation are worth holding loosely.

If you've hit the multi-file reasoning ceiling with a local model and found a workflow that helps (better context management, a different model, something in the Continue config), I'd genuinely like to know. That's the gap I haven't solved cleanly yet.