🖥️AI

Local AI Needs to Be the Norm

Michael Sintim-Koree · May 2026

Every time someone pastes a contract clause into ChatGPT, or drops a client email into Claude to draft a response, or uses a browser extension that silently sends page content to an inference endpoint — data leaves the building. Most people doing it have no idea where it goes, how long it's retained, whether it ends up in training data, or what the vendor's actual data handling policy says in the fine print.

The cloud AI defaults were set by convenience and early adoption, not by deliberate security decisions. That's worth fixing.

The data problem nobody talks about plainly

When you send a prompt to a hosted LLM, you're making an API call to someone else's infrastructure. The model runs there. The input — your prompt, your context, your attached documents — is processed on their hardware, logged by their systems, and subject to their retention and training policies. Some vendors are explicit about this. Many are not.

For consumer use, that tradeoff is probably fine. For business use, it often isn't. Privileged legal communications, unreleased financial data, patient records, source code, customer PII: all of it is being fed into cloud inference endpoints by people who haven't read the terms of service and wouldn't know what to look for if they did. In practice, this happens at organizations that would be horrified if they understood what was leaving their environment.

The argument that enterprise tiers have better data handling is true, partially. Microsoft's Azure OpenAI Service doesn't use your prompts for training by default. OpenAI's API tier has a similar policy — opt-out by default since March 2023. But even the cleaner enterprise agreements involve data in transit, data at rest on third-party infrastructure, and a dependency on your vendor's security posture — not your own. That's a different risk profile than running locally, and most organizations haven't sat down and actually compared the two.

What local AI actually looks like now

The infrastructure required to run capable models locally has changed dramatically in the last two years. This is no longer a research-only concern.

Ollama is the clearest entry point. It's a local model server that runs on macOS, Linux, and Windows, manages model downloads, and exposes an OpenAI-compatible API on localhost. Pull Llama 3, Mistral, Phi-3, or Gemma 2 and they're running on your machine in minutes. No API key. No network call. No data leaving the host.

The hardware requirements are real but no longer extreme. A modern laptop with a dedicated GPU handles 7B and 13B parameter models well, with inference fast enough for practical use. Apple Silicon machines — M2 Pro and later — run quantized 7B models through unified memory efficiently enough that most users won't notice latency. For organizations with workstations or servers, a single consumer GPU like an RTX 4090 with 24GB VRAM can run 70B models in quantized form using partial CPU offloading — a Q4 quantized 70B model requires roughly 38–43GB, so layers spill to system RAM, but the approach is viable for many workloads. That capability on consumer hardware would have been unthinkable a few years ago.

Open weights models have also closed the gap considerably. Llama 3.1 70B at Q4 quantization is genuinely good for general-purpose tasks: summarization, drafting, code review, classification. It won't match current frontier models on every benchmark. For a large class of day-to-day business tasks, the difference is small enough that it doesn't matter.

Where local runs the table

Legal, healthcare, financial services — any context where the document content itself is sensitive. Running a local model to summarize case files, extract structured data from reports, or assist with document review means the content never leaves the perimeter. The compliance story is cleaner. The audit trail is entirely internal. This is the argument that sells itself to legal and compliance teams once they understand what the alternative actually involves.

The DoD classified AI deployments getting attention right now are the extreme version of a problem that exists at smaller scales across regulated industries. Critical infrastructure operators, certain financial institutions, defense contractors — all of them have environments where calling out to a commercial API isn't just inadvisable, it's not permitted. Local model deployment is the only viable path. The Pentagon is sorting through this with seven vendors and classified accreditation cycles; most organizations can just run Ollama on a server.

Inference over a network has latency floors that don't exist locally. For interactive applications — real-time coding assistants, voice interfaces, anything where the response loop matters — local inference wins on latency once the hardware is adequate. The OpenAI API is fast, but it's never faster than localhost.

At low usage, cloud API pricing is cheap enough to ignore. At volume, it compounds. A team running thousands of API calls daily will spend real money on it by end of quarter. A local model on a $2,000 GPU runs those same workloads for the amortized cost of electricity. The crossover point is lower than most people expect, especially for organizations doing high-volume document processing or internal tooling at scale.

The actual limitations

Local AI is the right default for many workloads. It isn't the right answer for all of them.

Frontier model capabilities are still ahead of what you can run locally on practical hardware. Complex reasoning tasks, very long context windows, multimodal inputs: the open weights ecosystem is catching up, but catching up is not caught up. If the task genuinely needs frontier capability, cloud is still the answer.

Setup and maintenance are also real costs. Ollama is straightforward, but putting a local model behind an internal API, managing model updates, integrating it with existing tooling, and supporting it for a team — that's IT work that doesn't disappear because the model itself is free. Cloud APIs abstract all of that. For organizations without the capacity to manage it, that abstraction has real value.

Hardware is a capital expense. The economics favor local at volume, but only if you're actually using the hardware enough to amortize it. An organization running occasional inference workloads probably shouldn't buy a server for this. One running it continuously probably should.

A decision filter that actually works

The default right now is: use the cloud API because it's easy. The better approach is for organizations to invert that. Start with local and escalate to cloud only when there's a clear reason.

A simple filter:

Does the input contain sensitive, regulated, or confidential data? Local first, no debate.
Is this a high-volume workload where API costs will add up? Run the math; local often wins.
Does the task require frontier model capability that open weights genuinely can't match? Cloud is justified — but be specific about why, not just default.
Is this a one-off personal task with no sensitive content? Cloud is fine.

Most enterprise AI use cases fall into the first two categories. Most organizations are treating them like the last one.

The tooling worth knowing

Ollama is the server layer. For interfaces: Open WebUI gives you a ChatGPT-style browser UI on top of any Ollama instance — runs in Docker, takes ten minutes to set up, and gives non-technical users a clean interface without exposing the API directly. For developers, the OpenAI-compatible API endpoint means existing code that calls OpenAI's API can be redirected to a local Ollama instance by changing the base URL and setting a placeholder API key (the field is required by the SDK but ignored by Ollama). The migration path for internal tooling is shorter than most teams expect.

LM Studio is a solid alternative for desktop use across macOS, Windows, and Linux, with a GUI for model management and a built-in inference server. For Python-heavy workflows, llama.cpp bindings let you embed inference directly in an application without a separate server process.

On the model side: Llama 3.1 8B and 70B cover most general tasks, Phi-3 Mini works well on resource-constrained hardware, and Gemma 2 9B is a strong option worth testing. For code specifically, Qwen2.5-Coder and DeepSeek-Coder-V2 are both worth running against your actual workload before settling on anything — benchmarks don't always tell you what you need to know, and teams often pick the wrong model because they trusted a leaderboard number over a 30-minute test on real inputs.

The real shift required

Running local AI isn't technically hard anymore. The harder part is organizational: accepting that the convenient default carries real risk, and building the habit of asking where inference actually runs before deploying anything that touches sensitive data.

Cloud AI vendors have done a good job making their products feel seamless and trustworthy. Some of them are. That still doesn't make a cloud inference endpoint the right answer for data that shouldn't leave your environment. The technology to keep it local exists, it works, and the capability gap closes every few months. At some point the justification for defaulting to cloud stops being capability and starts being inertia.

The organizations that sort this out now will have cleaner compliance postures and AI infrastructure they actually control. The ones that don't will keep discovering, after the fact, what their tools were sending out.

If you're trying to make the case for local inference internally and hitting the 'but cloud is easier' wall — that's the specific conversation I'd want to hear about. The technical argument is usually the easy part.