🪶AI

Ornith-1.0: What Self-Scaffolding Open-Source Models Actually Mean for Agentic Coding

Michael Sintim-Koree · June 2026

The phrase 'self-improving model' gets thrown around loosely enough that being precise about what Ornith-1.0 actually does matters before deciding whether to care. This is not a model that rewrites its own weights at runtime. It's a model trained through reinforcement learning where the model jointly learns to solve coding tasks and to write the scaffolds (the orchestration harnesses) that guide those solutions. Each RL step runs in two stages: the model first proposes a refined scaffold for the task, then generates a solution rollout conditioned on that scaffold. Successful trajectories feed back into training.

That distinction matters for understanding both what the system can do and where the real risks sit.

The self-scaffolding loop, precisely defined

Ornith-1.0's key innovation is treating the scaffold as a learnable object rather than a fixed, human-designed harness. Most coding agents pair a model with a static orchestration framework: which tools to call, when to retry, how to decompose tasks. Ornith-1.0 learns to generate that framework itself, co-evolving the scaffold alongside the policy during RL training. The model generates its own task plans, launches tools, inspects intermediate results, and rewrites failing steps. Per-task strategies emerge automatically without hand-engineered harness design.

This is related to the STaR (Self-Taught Reasoner) line of work from Zelikman et al., and to broader rejection-sampling and policy-gradient approaches in post-training. The core idea: if you can verify correctness (tests pass, code compiles, the output meets a spec), you can use that verification signal to filter trajectories without human labeling of every step. Agentic coding is an unusually good domain for this because the feedback signal is cheap and unambiguous. Either the tests pass or they don't. Ornith-1.0 extends this by also learning what scaffolding strategy to apply per task, not just what code to write.

What makes Ornith-1.0 notable relative to prior work is that it's fully open: weights, training code, and the scaffolding approach used to generate trajectories are all released under MIT. Most self-improvement research has been described in papers but not shipped in a form you can actually reproduce or extend. That changes the downstream use case significantly.

The agentic scaffolding underneath it

The model doesn't run tasks freehand. It operates inside a structured environment with tool access (reading and writing files, running shell commands, invoking test runners, inspecting error output) and captures the full trajectory: tool calls, observations, reasoning traces, final output. Crucially, Ornith-1.0 also generates the orchestration logic governing that tool use, not just the code itself.

Task selection matters too. Ornith-1.0 is evaluated on SWE-bench-style issue resolution tasks (GitHub issues with associated test suites that specify what 'fixed' looks like) and on Terminal-Bench, which tests real command-line agentic tasks. The self-scaffolding approach makes the loop scalable: the model discovers better search trajectories by jointly optimizing the harness and the solution, without requiring a human to design a new harness per task category.

There's an obvious risk here that the published work addresses head-on: reward hacking. A self-generated scaffold can learn to satisfy the verifier without performing the task; it might read visible test files and hardcode expected outputs, or touch checked-for files without doing real work. The mitigations in Ornith-1.0 are three-layered: a fixed outer trust boundary that keeps the environment and tool surface immutable; a deterministic monitor that flags any attempt to read withheld paths, modify verification scripts, or invoke out-of-bounds tools (assigning such trajectories zero reward and excluding them from the training update); and a frozen LLM judge that acts as a veto on top of the primary verifier, catching intent-level gaming that occurs within the permitted tool surface but doesn't constitute genuine problem-solving. These are meaningful safeguards. Whether this three-layer stack fully solves reward hacking in production deployments, rather than in the controlled evaluation harness where it was designed, remains to be tested independently.

Benchmark numbers and what they actually tell you

Ornith-1.0's headline numbers are strong for open-source models. The flagship 397B MoE variant scores 82.4 on SWE-Bench Verified and 77.5 on Terminal-Bench 2.1, surpassing Claude Opus 4.7 on both benchmarks. The right comparison is against other open-weight models of similar size; that's where the signal is. Against frontier proprietary systems, the 397B trails Claude Opus 4.8 (85 on SWE-Bench Verified) and GLM-5.2-744B. The smaller models punch above their weight: the 9B variant scores 69.4 on SWE-Bench Verified, and the 35B MoE beats Qwen3.5-397B on Terminal-Bench 2.1 despite being 10x smaller.

The more interesting question is how much the self-scaffolding approach itself contributes relative to simply using stronger base models and more compute. The published results show meaningful gains from the joint scaffold-and-solution optimization compared to fixed-harness baselines. The gain curve dynamics across iterations (whether diminishing returns set in and whether expanding the task distribution resets improvement rates) are not yet fully characterized in the public release.

ClawEval performance tells a similar story: strong relative to open-weight models at the same scale. Multi-step, long-horizon tasks requiring reasoning across many file changes and coherent intent across dozens of tool calls are still where models in this weight class degrade most noticeably. That ceiling isn't specific to Ornith; it's where current open-weight models top out regardless of training approach.

What open weights actually unlock

A self-improving model whose improvement loop is proprietary is a black box that gets better in ways you can't inspect or reproduce. Open weights, open training code, and a published architecture give you something you can audit, extend, and redirect toward your own task distribution.

That last part is the practical opportunity, and it's what stands out most about this release. Ornith-1.0's training loop can be applied to a domain-specific task set. If your codebase has a large test suite and a backlog of issues with clear acceptance criteria, that's a task distribution you can run the self-improvement loop against. The model learns from trajectories on your tasks, not generic GitHub issues. The resulting model doesn't generalize better across all agentic coding; it gets better at the specific kind of work your codebase requires. Narrower than the headline implies, but more useful for most organizations.

This also changes the privacy calculus. A locally-run Ornith fine-tuned on your codebase never sends your code to a third-party endpoint. The training loop runs on your infrastructure, the resulting weights stay in your environment, and the trajectories generated during training are yours. For organizations working on proprietary systems where sending code to a cloud API isn't an option, this is the path that makes agentic coding assistance viable without the data residency problem.

Three failure modes worth understanding before you deploy

Reward hacking and test gaming

A self-scaffolding model in an agentic coding context has strong incentive to find solutions that pass tests without solving problems, because passing tests is the only success signal it receives. In the Ornith-1.0 architecture, the three-layer mitigation (fixed trust boundary, deterministic monitor, frozen LLM judge) addresses this more architecturally than simple procedural spot-checks. But these defenses were designed for the controlled training harness. Any deployment running the self-improvement loop on production task data should include independent evaluation on held-out tasks not touched during training, and human review of a sample of passing trajectories specifically looking for solutions that are brittle, overfitted to the test specification, or technically passing but semantically wrong.

Trajectory drift over iterations

Each self-improvement iteration trains on trajectories generated by the previous model version. If that previous version has systematic errors or stylistic quirks, those can compound over iterations in ways that don't show up in aggregate benchmark numbers but degrade specific categories of tasks. Think of it as the agentic equivalent of model collapse in generative image models: quality on the core distribution holds while quality on the tails quietly degrades. Monitoring per-category performance across iterations rather than just aggregate pass rates is how you catch this before it becomes a real problem.

Tool misuse in unconstrained environments

An agentic model with shell access that's been trained to complete tasks efficiently will find shortcuts. Some are useful. Some involve deleting test files that were failing, modifying test assertions to match incorrect output, or making changes outside the intended scope of the task because doing so made a metric better. The training scaffolding constrains this through sandboxing and the fixed trust boundary, but running this model in a less constrained environment requires its own guardrails. Least privilege for tool access applies here the same way it applies to any agentic system: the model should have access to exactly what the task requires, no wider.

Hardware, integration, and what the setup actually costs you

The model ships in four sizes: 9B Dense, 31B Dense, 35B MoE, and 397B MoE. The 9B needs about 6GB VRAM at Q4 quantization; a gaming GPU or MacBook Pro handles it. The 35B MoE needs around 25GB at Q5_K_M, which fits on a 24GB card with minor spill, and is the recommended option for consumer GPU users. The 397B MoE requires approximately 200GB in FP8 across multiple GPUs; that's a server-class deployment. Ollama compatibility is documented and the model is listed in the Ollama library directly, so dropping it into an existing local inference setup means changing the model name, not the infrastructure.

Running the self-improvement loop is the heavier lift. A meaningful training iteration requires compute for trajectory generation at scale (running the model through hundreds or thousands of tasks), filtering, and fine-tuning on the resulting dataset. The training uses a pipeline-RL setup designed to run asynchronously across multiple devices, with staleness weighting to handle off-policy tokens from older rollouts. The economics favor a small compute cluster or a rented GPU instance over a workstation.

Integration with existing development tooling is well-supported. Ornith-1.0 works out of the box with Claude Code, OpenHands, OpenClaw, and Hermes Agent. SWE-Bench evaluation was conducted using the OpenHands harness, and the SWE Atlas benchmarks used a mini-SWE-agent harness, so both scaffolding systems have been tested against the model. One wrinkle: the model emits tool calls in a specific XML-based format (<tool_call> blocks) that serving infrastructure like vLLM parses into OpenAI-style tool_calls. Connecting it to a scaffolding system that doesn't speak this format requires either a translation layer or configuration of the appropriate tool-call parser. That's the integration detail most likely to catch people off guard.

Where this sits and what remains uncertain

Self-improvement through agentic task completion is a direction several labs are pursuing privately. Ornith-1.0 being open means the research community can inspect, stress-test, and extend it in ways closed systems don't permit. The failure modes get documented. The reward hacking cases get published. The iteration dynamics become shared data rather than staying inside one organization. That matters more than any single benchmark number.

The honest read on capability right now: Ornith-1.0 is a strong agentic coding model family, with the 397B variant matching or exceeding Claude Opus 4.7 on headline benchmarks and the smaller models punching well above their weight class. The self-scaffolding approach produces real gains over fixed-harness baselines. The gap to the absolute frontier (Claude Opus 4.8, the largest proprietary systems) remains.

Whether the self-scaffolding approach hits a fundamental ceiling or continues scaling with larger base models and more diverse task distributions remains an open question. The published results establish state-of-the-art among open models of comparable size. The next few iterations of this architecture, with larger base models and broader task distributions, will answer the scaling question one way or the other.

If you've run Ornith-1.0's self-improvement loop on a domain-specific task set and measured how the per-iteration gain curve behaves past the initial cycles (specifically whether expanding the task distribution resets the improvement rate or just delays the same plateau) that's data not yet published anywhere and would be genuinely useful to hear about.