Software Architect, Accessibility & Sustainability evangelist, Human.
12 June 2026
Wow, I haven’t written a post here in 6 years… I think it’s time I start again. 😊
I was using Claude, ChatGPT, Cursor, and others… All of them at once. Each of them has its pros and cons, each of them is best at something, and they all have some gaps that another one fills beautifully. However, the bills just started piling up. And it’s not like I could just cancel one of them because I kept hitting the limits of the max plans on all of them. At some point, it became clear that paying “rent” for AI was unsustainable long-term. So an investment was made to become somewhat independent.
When I was thinking of starting this journey, I had a great discussion (as always) with Joost de Valk, and as a result of that discussion, the initial (substantial) funding for this journey was from my employer, emilia.capital. I then continued by expanding the build with more compute over the course of months - and I fear it’s not done yet 😅
I started with 2x R9700 GPUs. Over the course of the following months I managed to get 2 more R9700 GPUs (one every couple of months) and a used NVIDIA RTX3090.
Of course they can’t all connect to a motherboard with just 2 PCIe ports, so the 4 AMD GPUs are now connected to a bifurcation card (R34A), and the RTX3090 is connected to an M.2 slot via an M.2-to-OcuLink-to-PCIe adapter.
Future expansion: I can still connect 2 more GPUs: One on the remaining M.2 slot, and another one via USB4. I already have the adapters for these, just not the GPUs yet - looking on Vinted daily for opportunities.
The reason I went with AMD and not NVIDIA is twofold:
Would it be better to go for NVIDIA GPUs instead of AMD? Well… I would definitely have a faster system, but I would pay for it many times over in the electric bill, and performance is not the only metric that matters. There are lots of ethical concerns in the middle too.
Pros: Lots of VRAM.
Cons: vLLM is basically pointless on this setup. PCIe speed has dropped significantly due to the bifurcation and all the adapters, which means that tensor-splitting a model is painful. My router (explained below) can run vLLM/safetensors backends but due to the way my hardware evolved I don’t bother (not to mention that ROCm made running vLLM a PITA). Instead, I’m using llama.cpp with Vulkan (not Cuda for the NVIDIA + ROCm for the AMDs, that was painfully slow).
This is where things get a little more interesting. Had to build a couple of projects in order to accommodate my personal workflow.
This is a plugin for OpenCode which does the following:
The plugin adds multiple agents and sub-agents. Some of them are public-facing, others are internal only.
The way I usually work is this:
scout agent for discussion, brainstorming and planning. Tried to make it think a bit like I do when planning, so it’s a multi-step process. Kinda crude, but it works.orchestrator agent which handles the actual implementation and coding. The orchestrator calls multiple sub-agents as needed, and also implements a best-of-N flow when needed.These are the main agents, the others are mostly internal:
a team of specialists (architect, coder, reviewer, tester, critic, researcher, design-reviewer, quality-controller) that the orchestrator delegates to. The one I’d single out is the quality-controller. After any coding task, the orchestrator has to run it - it bundles the reviewer, the design-reviewer and a completion checklist and returns either APPROVED or REJECTED. The orchestrator can’t skip it, and after 3 rejections it stops and asks me what to do.
Memory usually fails in coding harnesses because the agent needs to judge when to call memories, when to save a memory etc. It’s a fragile thing.
The memory-steward does something slightly different: It has a small model running at all times (currently it’s a qwen3.5-4b model but that’s easy to change from the markdown file of the steward). The key idea is that the main coding model never has to decide anything about memory - the steward handles both sides, retrieving and saving.
What basically happens is this: When I send a message to OpenCode, it doesn’t go directly to OpenCode… It goes to the memory-steward service first. The steward rewrites my message into a tight search query, looks up relevant memories (via a memory MCP - currently OpenMemory, but it’s easy to switch that), and injects whatever it finds into that turn’s context before the main model ever sees the message. This happens on every single turn.
The rewrite step is more important than it sounds. Memory lookup is an embedding search, and embedding search is surprisingly sensitive to phrasing - two messages with the exact same intent but worded differently will land in different places. If I search with my raw, rambling message, I get worse hits than if I search with a clean 3-8 word query. So the steward’s first job is to strip my message down to what I’m actually looking for, then search.
Then, after the exchange is done, the steward looks at it again and makes a separate judgement call on whether anything in it is worth saving. If so, it writes it back to the memory MCP. So retrieval runs constantly in the background, and saving is a deliberate, judged decision - and neither one depends on the main model remembering to ask.
Local models that can run on my hardware don’t have 1M context, so I need to account for that - as well as preferences persistence on a per-project basis.
The scratchpad is a simple skill that does exactly what you’d expect: It’s a scratchpad, allowing the LLM to write to a file things we’ve discussed so that they stay pinned. It’s also used for internal notes of the LLM itself. Scratchpad notes are appended to the system prompt so the model can always see them.
When OpenCode trims the conversation to stay under the context limit, the scratchpad gets re-injected, so the things I pinned survive the compaction even though the rest of the conversation got summarized away. That’s the whole point - it’s the bit of context that doesn’t get thrown out.
This is just a handoff document that allows me to switch between agents and persist long sessions.
Think of it like the plan documents that Claude creates. Same concept, different implementation.
This is just another skill file. When I make a plan with the scout agent, it writes the handoff document. When I then switch to the orchestrator agent, it reads that file and starts the implementation.
Too many models, too many configurations, it’s hard to keep track of everything. I needed to be able to easily load multiple models in VRAM, switch between them easily, be able to edit the configs from a web-UI and so on. This project does something similar to what llama-router does, but more tailored to my own flows. It is a fully transparent proxy, and basically instead of pointing my OpenCode config to llama.cpp directly, I point it to this project instead.
qwen3.6-27b for example), but occasionally they fall into infinite thinking loops or infinitely looping tool calls. The inference-router handles both, with two separate guards. Thinking loops are caught by a streaming detector that watches the output for repeating text (a tandem-repeat detector running on the token stream); when it trips, it replays the partial output and injects a corrective prompt, then retries. Looping tool calls are caught by a separate cross-turn guard that watches for a repeating tool_call → result → tool_call pattern across messages. The injected nudge isn’t subtle, and it escalates if the model keeps looping - the first interrupt asks it to step back and identify the real next action, and by the third it’s down to “emit only the next required action, no thinking, no commentary”.--tensor-split to greedily fill across the GPUs, so I don’t have to hand-tune placement every time.nvidia-smi, and even Intel Arc if I had one plugged in (I have one but it’s on a different dedicated assistant device - so I know it works). The same physical card can be addressed as either a CUDA device, a ROCm device, a Battlemage device, or a Vulkan device depending on what I want a model to use. Some models run better on ROCm in AMD GPUs, others (most) run better on Vulkan, CUDA makes the GPU fans go nuts while Vulkan is more quiet and so on… So I can define the backend on a per-model basis.Before any of the above existed, there was this. It’s worth mentioning because it’s where the whole journey started, and because the idea behind it is still useful on its own if you’re not ready to go fully local yet.
Back when I was still paying for Claude, ChatGPT and the rest, the thing that seemed insane was paying frontier-model prices for grunt work. Generating boilerplate, applying a review’s feedback, writing tests, reshaping a JSON config, or simple repetitive tasks… none of that needs a frontier model. Nobody needs a frontier model for simple stuff.
So the idea was simple: keep the expensive model as the senior engineer who thinks, decides and verifies - and delegate the mechanical work to local models, at zero marginal cost.
In practice it’s an MCP server. You register it with Claude Code (or any MCP-compatible agent) and it exposes the local models as a set of tools: generate_code, edit_code, improve_code, simplify_code, generate_tests, review_code, plan_task, transform_json. Claude stops writing code itself and instead orchestrates: it plans the task, hands the actual typing to a local coder model, sends the result to a local reviewer model, feeds the review back for a fix, and only then verifies the result itself. The frontier model goes from writing every line to making decisions about lines other models wrote - which is a lot fewer tokens, and a much smaller bill.
It worked well enough that the obvious next question was: if the local models are doing all the real work anyway, do I even need the frontier model in the loop? That question is what led to ergon.studio and the fully-local setup. So orchestrator was a stepping stone - but a genuinely useful one, and if you’ve got some local compute lying idle and a Claude bill you’d like to shrink, it’s a low-effort way to do exactly that.
IMPORTANT NOTES:
CLAUDE.md/AGENTS.md etc files to force them to delegate tasks.Well… Yes and no.
Frontier models are genuinely amazing. An open-source model with a few billion parameters cannot compare to a frontier model that is (probably) trillions of params!! However, the harness is sometimes more important than the model itself. Think of it this way: You have a brilliant person on one hand, and an average one on the other hand. You can tell to the brilliant person “build this”, and they will. They will make lots of assumptions along the way, lots of things that you probably won’t like and won’t catch. The end result is usually great and they work FAST! And then you have the average person. The one that needs hand-holding. That’s what the harness does… it’s the handholding part. That’s what the scout agent for example does in ergon.studio when I want to make a plan! It breaks the process down into steps:
These are steps that I specifically defined, in order to make the average person produce code that is on par with a frontier model.
Does it work? Yes, it does. However, you need to adjust your workflow significantly. With a frontier model you can build projects with one-shot prompts. You give them a goal and they start working towards that goal. With local models you can’t do that efficiently. I mean sure you can, but not if you want genuinely good results. So my current workflow is not “one-shot” builds. I am the senior engineer, the LLM is the junior. I know what I want to accomplish, I have the architecture in my head, the LLM helps write the code one step at a time. I don’t make a grand plan, I make smaller plans, one at a time. The grand plan lives in my head, the LLM gets parts of it. So I have for example 3 terminals open:
So yes, local models can actually accomplish what a frontier model does - just not using the same mental model or process.
Keep in mind that local models get released quite often… this is what makes this really exciting for the future!
The limitations I faced and all the things I needed to do to get around them are because of the current state of OSS LLMs. It was different 6 months ago, and it will be different 6 months from now. Friction keeps reducing, and you can think of OSS LLMs as the frontier models from 6-12 months ago. We were doing lots of things last year with Claude, right? Since then Claude has moved on… But we can get similar intelligence to last year’s Claude with Qwen3.6-27b and others today - or whatever you can run locally. Want to use Claude Opus 4.8? Sorry, can’t do unless you have the budget for half a TB VRAM. Then you could probably run HUGE local LLMs that are really frontier and on par with Opus.
That’s all for this post, hope it helps someone. When I started this journey most of this stuff did not exist, nowadays you can probably find multiple similar alternatives to what I did - but regardless, it was worth documenting these.
This post was written by a human.