Routing-First Local LLM Architecture

A Convergent-Invention Reading of 2024-2026 Research Literature

By WayneColt · June 2026

Abstract

Between May 2023 and June 2026, eight independently authored papers converged on a single architectural pattern: route inference requests across a heterogeneous pool of language models rather than scale a single one. FrugalGPT (2305.05176), AutoMix (2310.12963), RouteLLM (2406.18665), TensorOpera Router (2408.12320), UMR (2502.08773), Cascade-Aware Multi-Agent Routing (2603.17112), the Dynamic Model Routing and Cascading Survey (2603.04445), and Cascadia (2506.04203) each operationalized a different facet of the same kernel. The kernel is not new. The reliability function R = 1 − ∏(1 − p_n) was proved by Bernoulli in 1683 and re-derived in every subsequent generation of communications and information theory. What 2024-2026 demonstrates is not an invention — it is the simultaneous arrival of independent teams at the same point on a demand landscape that production economics had already drawn. This essay reads that cluster honestly, walks the math back to its 17th-century origin, surveys the patent floor (which is, on architectural-pattern grounds, conspicuously empty for reasons §4 will detail), notes that a parallel internal team has been building against the same kernel since July 2024 with a deployed multi-LLM router and ten thousand production dispatches on file, observes that in June 2026 a frontier model vendor shipped the same gated-cascade pattern inside its own serving stack as a native safety mechanism, and closes with the implication that matters: enterprises buying agent-execution platforms in mid-2026 are missing a routing layer above them. That gap is where the cluster is converging. None of us got there alone. None of us can claim the kernel. The question is what we build on top of it.

§1 The Cluster

Eight papers. Three years. Ascending sophistication, but a single architectural through-line.

FrugalGPT (Chen, Zaharia, Zou · arXiv 2305.05176 · May 2023) framed the economic case first. Querying a frontier model on every request is wasteful when most queries are answerable by a cheaper one. FrugalGPT's contribution was a learned cascade — a sequence of progressively more capable models, each given the chance to answer before the next one is invoked, with a quality scorer deciding when to stop. The paper reported that the cascade could match GPT-4-grade quality with up to a 98% cost reduction — or, holding cost fixed, improve accuracy over GPT-4 by 4% — on the benchmark sets they tested. (These are two distinct operating points on the cost-quality frontier, not a single combined figure; the precise abstract claim is "match the performance of the best individual LLM with up to 98% cost reduction or improve the accuracy over GPT-4 by 4% with the same cost.") The architectural primitive was the cascade. The economic primitive was differential pricing across the model market. Both have held.

AutoMix (Aggarwal, Madaan et al · arXiv 2310.12963 · October 2023) tightened the verification half. Where FrugalGPT used a learned scorer, AutoMix used self-consistency sampling from the cheaper model itself — if the cheap model agreed with itself across temperature variations, escalation was unnecessary. The contribution was making the gating decision endogenous rather than learned-and-frozen. This is the first paper in the cluster where the cascade gate adapts at inference time without retraining.

RouteLLM (Ong et al · arXiv 2406.18665 · June 2024) introduced preference-data-trained routers. Rather than a hand-tuned threshold or a self-consistency probe, RouteLLM trained a small classifier on Chatbot Arena win-rate data to decide between a strong and weak model per query. The abstract reports cost reductions "by over 2 times in certain cases—without compromising the quality of responses"; the paper body and its associated release reported routing that preserved roughly 95% of GPT-4 quality at around a quarter of the cost on standard benchmarks. The architectural primitive: the router is itself a learned model, and its training signal is human preference data on the very models it routes between.

TensorOpera Router (Stripelis et al · arXiv 2408.12320 · August 2024) packaged the pattern as a production system. TO-Router added domain-aware routing — different routers for different task families — and reported up to 40% efficiency gains and up to 30% cost reduction over standalone expert models while maintaining or improving quality. This is the first paper in the cluster from a vendor (TensorOpera) rather than a research lab, and it marks the transition from "FrugalGPT was a clever paper" to "this is how you actually serve LLMs in 2024."

UMR (Universal Model Routing · arXiv 2502.08773 · February 2025) generalized further: a single routing policy (the paper's UniRoute approach) that handles routing across a pool of models — including models unseen at training time — without per-task router retraining. The contribution was the unification: one policy, many task surfaces, dynamic membership.

By early 2026 the cluster split into two distinct papers within weeks of each other, both engaging with multi-agent and cascade architectures. Di Gioia published Cascade-Aware Multi-Agent Routing: Spatio-Temporal Sidecars and Geometry-Switching (arXiv 2603.17112), explicitly framing routing as a layer above agent execution — a scheduler that models how failure propagates differently through tree-like versus cyclic execution graphs and selects routing geometry accordingly. Around the same window, the Dynamic Model Routing and Cascading Survey (Moslem and Kelleher · arXiv 2603.04445) systematized the field, organizing the literature into six routing paradigms (difficulty-aware routing, human-preference alignment, clustering-based methods, reinforcement-learning approaches, uncertainty quantification, and cascading systems).

Cascadia (arXiv 2506.04203 · June 2025) operationalizes the cascade idea with serving-level co-optimization — joint scheduling of request routing and cascade deployment, handling heterogeneous resource demands and workloads. Cascadia is not a routing paper per se; it is a paper that takes routing and cascading as a settled architectural assumption and asks how to make the serving substrate fast.

The pattern is unmistakable. Eight papers. Three years. One architectural primitive — the heterogeneous routed cascade — operationalized at progressively deeper layers of the stack. None of these papers cite each other in a way that suggests collusion or coordination. The convergence is real and independent.

And it did not stop at the application layer. On June 9, 2026, Anthropic released Claude Fable 5 — its first generally available model of a new capability tier — built with a classifier layer that, on flagged topics, silently hands the response to a lower model (Claude Opus 4.8) mid-session rather than answering directly. On the API this surfaces as stop_reason: "refusal" returned on an HTTP 200 success, reporting which classifier declined; the vendor's own system card reported that on one agentic benchmark (Terminal-Bench) roughly 20.9% of trials fell back to the secondary model mid-trajectory. Read architecturally, this is exactly the cluster's pattern — a gated cascade across heterogeneous models, with a learned gate deciding per-request which model answers — except that it has been productized inside a frontier model's serving stack, as a native safety mechanism, rather than assembled above the model by an integrator. The convergence has now reached the model vendors themselves: routing is no longer only the layer an application builds above the model; a vendor has shipped it as a layer inside the product. This is offered here in the essay's register — not as anyone's victory, but as one more independent arrival. It is a ninth-plus instantiation of the same kernel, from the team with arguably the deepest view of the model market, reached on its own terms and for its own reasons. Convergent invention does not get more legible than the substrate's largest participants independently rebuilding the structure the substrate selects for.

§2 The Math

The reliability function for an n-stage system in which any stage's success constitutes overall success is:

R = 1 − ∏ₙ₌₁ᴺ (1 − pₙ)

This is textbook reliability theory. It is also textbook redundant-systems theory, textbook channel coding, textbook population genetics, and textbook ensemble forecasting. It is not new in any sense relevant to a 2026 patent claim.

The earliest derivation in print is Jakob Bernoulli's Ars Conjectandi (published posthumously 1713; manuscript dated 1683), which gave the multiplicative form of independent-event probability and from which the parallel-redundancy formula follows in two algebraic steps. Claude Shannon's 1948 A Mathematical Theory of Communication re-derived equivalent forms in the channel-coding case (multiple independent error-correcting passes; the channel succeeds if any pass succeeds). Manfred Eigen's 1971 work on quasi-species and error catastrophe used the same kernel to model the survival of replicating molecular populations under mutation pressure. R. A. Fisher's 1930 Genetical Theory of Natural Selection used it implicitly when deriving the survival probability of a beneficial mutation across independent lineages.

When eight 2024-2026 papers converge on this kernel, they are not inventing it. They are finding it — discovering that the production constraint they face (model heterogeneity, differential cost, differential latency, differential reliability) is best addressed by a structure mathematics has known about for three centuries.

This matters for the question of priority. Architectural priority on the application of a centuries-old mathematical kernel to a new substrate is real — someone was first to wire FrugalGPT, someone was first to deploy a learned router in production, someone was first to think of cascading prompts across two LLMs in a side project. But priority on the kernel belongs to no one alive. The kernel is found, not invented.

The honest framing is convergent invention: many independent teams, applying the same kernel to the same emerging substrate (the 2023-2026 LLM market), arriving at structurally similar architectures. This is the dominant pattern in technology history. Calculus had two independent inventors. The telephone had a contested priority. The integrated circuit had two simultaneous patents (Kilby and Noyce). When the substrate is ready and the kernel is available, multiple teams converge.

§3 The Convergent Operationalization

Each cluster member operationalized a distinct facet of the kernel. A brief reading of each, paired against the same internal architecture a parallel internal team has been deploying since 2024:

FrugalGPT (2305.05176) operationalized the cost-quality cascade. Cheap model first, escalate on confidence threshold. The internal team's deployment of the same primitive used keyword-and-context classification to choose the entry tier, then escalated on a stuck-level signal. Same kernel, different gating mechanism.

AutoMix (2310.12963) operationalized self-verification gating. The cheap model votes against itself before escalating. Internal deployment uses a degeneracy-detection probe that runs in parallel with inference; both approaches make gating endogenous to the cheap model rather than learning a separate scorer.

RouteLLM (2406.18665) operationalized preference-trained routing. Train a classifier on preference data, route per-query. The internal team trained a quadrant classifier (math, spatial, code, action) on a domain-typed corpus — a structurally similar move with a different label space. RouteLLM optimizes for cost-quality Pareto on a single benchmark; the quadrant approach optimizes for specialist-routing across heterogeneous task families.

TensorOpera Router (2408.12320) operationalized domain-aware production routing. Different routers for different task families, productionized. The internal team's deployment runs eleven specialist workers behind a single front router, with per-task families dispatched to specialist endpoints. Different vocabulary, same architecture.

UMR (2502.08773) operationalized universal routing policy. One policy across many tasks, including models unseen at training time. The internal team's front router is structurally a UMR — a single routing surface that handles classification, dispatch, fallback, and budget enforcement across all task types.

Cascade-Aware Multi-Agent Routing (2603.17112, Di Gioia) operationalized routing-above-agent-execution. The router selects not just which model but which agent pipeline to invoke, and it reasons explicitly about how failure propagates through the execution graph. This is the layer that the internal team's front router has occupied since late 2024 — routing across both LLM endpoints and agent-style sub-pipelines (orchestrator, vision agents, financial query servers, retrieval pipelines). Di Gioia's framing ("the router is the layer above the agent, and it must model failure-propagation geometry") is the cleanest articulation in the cluster of what has been deployed-but-unpublished elsewhere.

Dynamic Model Routing and Cascading Survey (2603.04445) operationalized taxonomy. The field is now mature enough to survey. The survey organizes the literature into six routing paradigms; it does not propose new architecture but consolidates the cluster.

Cascadia (2506.04203) operationalized serving-substrate optimization. Joint optimization of request routing and cascade deployment, handling heterogeneous resource demands and workloads. This is the lowest layer in the cluster — it assumes routing-and-cascading and asks how to make the substrate fast. The internal team's deployment also runs slot-managed inference servers with cache transfer across specialist endpoints for similar reasons.

The honest reading: each paper instantiates a different facet of the same kernel against a different production constraint. The cluster is a coordinated discovery not because the teams coordinated, but because the substrate (model market, production economics, latency-cost tradeoffs) selected for the same architecture from many independent starting points.

§4 The Empty Patent Floor

The granted-patent floor for routing architectures in the LLM space is — at the level of architectural pattern — conspicuously empty. There are filings. There are grants. They cluster in IP class G06N3/0455 (transformer architectures and adjacent neural-network dispatch methods) and the surrounding G06N3/04 sub-classes. What I cannot find, after the kind of casual prior-art read a non-attorney can perform on public databases, is a granted claim that would block — on architectural-pattern grounds — the eight papers in §1, the deployed systems they describe, or any reasonable derivative thereof. The floor is dense with implementation-specific claims and effectively empty of architectural-pattern claims.

This is not an accident of timing. It is the consequence of a structural pattern that long predates the LLM market: algorithms become commodities through commercial iteration, not through patent enforcement. Three case studies make the pattern visible.

Case 1: Psychometric routing as the canonical algorithm-to-commodity arc. In 2013, Michal Kosinski, David Stillwell, and Thore Graepel at the Cambridge Psychometrics Centre published evidence that the OCEAN five-factor personality model could be predicted with high reliability from public Facebook Likes (PNAS 110:5802-5805, Private traits and attributes are predictable from digital records of human behavior). The academic contribution was a measurement claim: digital trace data is a sufficient substrate for psychometric inference. The commercial iteration that followed is the rest of the story. The kernel was weaponized at scale during the 2016 cycle. Sandra Matz, Michal Kosinski, Gideon Nave, and David Stillwell's subsequent peer-reviewed work (PNAS 2017, 114:12714-12719) demonstrated that psychographically matched advertising creative produced materially higher click-through and conversion rates than demographically matched creative on equivalent budgets, across field experiments reaching over 3.5 million individuals. Hauser, Urban, and Liberali's Website Morphing program at MIT productionized the same kernel a layer further in: dynamic UI/UX adaptation to inferred cognitive style from clickstream signal.

The pattern: the algorithm itself was academic and openly published. The patent floor around psychometric inference from social data is — on architectural-pattern grounds — also conspicuously thin. What was enforceable, and what compounded, was implementation at scale: the data pipelines, the identity-resolution graphs, the experiment infrastructure, the creative-generation systems, the buyer relationships. The algorithm became free precisely because it had to, in order for the commercial substrate around it to grow. Patents on the pattern would have foreclosed the substrate. So the pattern was left in the commons, and value was extracted from the co-specialized assets that the pattern enabled.

Case 2: Arcball, quaternions, and the open-standard adoption pathway. In 1992, Ken Shoemake published ARCBALL: A User Interface for Specifying Three-Dimensional Orientation Using a Mouse (Graphics Interface '92, pp. 151-156) — the quaternion-based trackball-rotation algorithm that became, within roughly a decade, the universal interaction primitive for any 3D viewer. Three.js uses it. Every consumer CAD package uses it. Every architectural exploded-view tool uses it. The "exploded view" itself — the disassembled-view metaphor that lets a viewer see a complex object's internal architecture by separating its layers — became standard visualization vocabulary across CAD, technical illustration, and engineering documentation.

Shoemake never enforced a patent floor on Arcball. The quaternion math itself dates to Hamilton, 1843. The algorithm spread through the community by being correct, openly described, and useful. It is now infrastructure, in exactly the sense that hash tables and B-trees are infrastructure: nobody licenses them, everybody uses them, and the value accrues to the products that compose them well. This is the second pattern — the open-standard adoption pathway — and it is what happens when a technical primitive is too useful and too discoverable to be successfully gated.

Case 3: Omniscient marketing as routing-architecture-by-another-name. Andrew Pole's pregnancy-prediction work at Target — popularized in Charles Duhigg's 2012 New York Times Magazine piece — documented a Bayesian model that combined a small set of product-category signals into a posterior probability of intent, then routed expectant customers toward category-specific outreach. The architectural primitive was: classify the user-state, select the next-best-action, dispatch the message, observe the outcome, update the policy. The model itself was Bayesian; the substrate was loyalty-card sequence data.

By 2026, this kernel has migrated into every credible enterprise martech stack along three axes that need not be name-checked individually to be recognized: real-time identity-graph platforms (cross-device, post-cookie deterministic-plus-probabilistic resolvers); clean-room infrastructures that enable shared inference without raw-data sharing; and sequence-transformer recommender stacks (the SASRec / BERT4Rec lineage and its descendants, surveyed in Kang & McAuley, ICDM 2018, arXiv 1808.09781 and Sun et al., CIKM 2019, arXiv 1904.06690) that do for next-action prediction what the GPT family does for token prediction. The architectural primitive across all three layers is routing: classify the user-state, select the next-best-action, dispatch the message, observe the outcome, update the policy. It is the cluster of §1 with a marketing skin on it. It is the same kernel.

What ties the three cases together: in each one, the architectural pattern itself was either openly published, openly re-derived, or trivially reverse-engineered from the commercial outcomes. None of the three pattern-classes was successfully gated by IP enforcement. What was — and is — defensible is the layer below the pattern: the implementation-specific primitives (a particular gating algorithm, a particular cache-warming strategy, a particular identity-resolution graph) plus the complementary assets (telemetry, calibration, distribution, substrate control) that David Teece's 1986 framework names directly and that §6 will return to.

The patent floor for routing-cluster architectures is empty of architectural-pattern claims because architectural-pattern claims on found mathematics tend to fail novelty review, and because the pattern's commercial logic requires it to remain in the commons. Routing-cluster systems are reliability-theory applications. Reliability theory is centuries old. Patents on the kernel are unsustainable. Patents on a particular gating mechanism, a particular router-training procedure, a particular cache-coherence protocol — those can stand. They have stood. They cluster in G06N3/0455 and G06N3/0464, and they are appropriately narrow.

This is the structural reason the cluster of §1 was able to publish openly without IP-collision risk. It is also the structural reason the parallel internal team's open-source release (wayneColt/multi-llm-router-demo, MIT, April 2026) does not need to claim the architectural pattern in order to derive value from having implemented it.

The defensible value is not the pattern. It is the substrate around the pattern — the operational telemetry, the calibration, the distribution surface, the complementary assets. Patent enforcement is the wrong frame for routing-architecture priority. Co-specialized complementary assets — the stuff §6 will catalogue — are the right one.

§5 A Parallel Internal Record

I am writing this essay as WayneColt, but I have access to a parallel internal team's records and I am going to use them, because the question of when this architecture was first deployed is empirical and decidable.

The earliest dated artifact in the parallel internal team's archive is a multi-LLM router prototype with a verbatim file timestamp of 2024-07-23. The prototype routed text queries across four model endpoints (two cloud, two local) using a keyword-and-context classifier as the dispatch primitive. By September 2024 it had been wired into a production messaging surface. By November 2024 it was running continuous workloads. By April 2026 a sanitized, MIT-licensed implementation of the same primitive was published as wayneColt/multi-llm-router-demo, with a HISTORICAL NOTE in the README preserving the July 2024 priority date.

Paired against the cluster timeline:

Date	External	Internal
May 2023	FrugalGPT (2305.05176)	—
Oct 2023	AutoMix (2310.12963)	—
Jun 2024	RouteLLM (2406.18665)	—
Jul 2024	—	Multi-LLM router prototype (4-model, classifier-dispatched)
Aug 2024	TensorOpera Router (2408.12320)	Production messaging integration
Nov 2024	—	Continuous-workload deployment
Feb 2025	UMR (2502.08773)	Quadrant classifier (math/spatial/code/action) deployed
Mar 2026	Di Gioia (2603.17112) + Survey (2603.04445)	Multi-vendor cascade with stuck-level escalation deployed (10K+ dispatches)
Apr 2026	—	Sanitized open-source release: wayneColt/multi-llm-router-demo (MIT)
Jun 2025	Cascadia (2506.04203)	(predates internal slot-managed cache-transfer work)
Jun 2026	Claude Fable 5 ships classifier→fallback cascade inside the model serving stack	Slot-managed cache transfer across specialist endpoints deployed

The pattern is what convergent invention looks like when you put the timelines side by side. The internal team is sometimes earlier, sometimes later, never first-of-everything, never last-of-everything. The internal team is one of many teams converging on the same architecture. That is the point. None of us got there alone.

What the internal team has that most of the eight papers do not is production-volume operational data. Ten thousand-plus production dispatches on a multi-vendor cascade with stuck-level escalation. Twenty-four-plus continuous overnight inference sessions on heterogeneous workloads. A multi-node deployment across four physical machines (two consumer GPUs, one workstation-class GPU, one accelerator-class TPU stack, one ARM SBC with a dedicated NPU). The papers in §1 are mostly benchmark-validated; the internal deployment is workload-validated. Both kinds of validation matter. The papers establish that the architecture is right. The deployment establishes that the architecture is buildable on commodity hardware by a small team.

A June 2026 internal audit sharpened the record in a way worth reporting as a practitioner finding. The audit found the same pass-or-escalate cascade kernel instantiated not at one layer of the internal stack but at ten distinct layers, each independently designed and each isomorphic to the others: inference-cost routing (cheap model first, escalate on a stuck signal); prompt classification (categorize the request, route to the matching handler, escalate the unrecognized); multi-tier validation ladders (each tier admits or escalates a candidate claim); approval and permission cascades (auto-approve the routine, escalate the consequential to a human); dispatch-layer tier taxonomies (autonomous → notify → ratify → in-person, each tier admitting what it can and passing up what it cannot); hardware accelerator fallback chains (try the fast accelerator, fall back to the next device on failure); training-signal escalation (cheap supervised pass first, escalate hard cases to a stronger teacher); and several smaller instances besides. All of them are the same chain of pass/escalate stages. All of them are governed by R = 1 − ∏(1 − p_i): overall success is the complement of every stage independently failing. The kernel was not designed once and reused; it was re-derived locally ten times by people solving ten different problems, which is the convergent-invention thesis playing out inside a single stack.

The audit also surfaced the inverse failure mode, and this is the part offered as a finding rather than a restatement. A cascade is corrupted not only when a stage overstates its p_i — claiming it can handle a request it cannot, and thereby suppressing an escalation the system needed — but equally when a stage falsely sets its own p_i to zero in order to force an escalation it did not need. Over-escalation is the exact dual of over-confidence. A validation tier that reflexively kicks every claim upstream "to be safe" degrades R = 1 − ∏(1 − p_i) precisely as much as a tier that waves through claims it should have caught: in the first case the cascade pays full cost for stages that add no information; in the second it skips stages that would have caught the failure. The load-bearing discipline in any deployed cascade is therefore honest per-stage probability estimation — each stage reporting its true competence on the request in front of it, neither inflating it to look capable nor deflating it to offload responsibility. To this team's knowledge, the over-escalation failure mode is under-discussed in the cluster's papers, which concentrate on the over-confidence side (calibrating gates so cheap models do not answer beyond their competence). The symmetric risk — gates that escalate beyond necessity — is just as real in production and just as costly, and it is offered here as a practitioner contribution to the cluster's shared understanding.

There is a second strand of the same internal record worth mentioning, because it is parallel-internal evidence of an adjacent kernel being instantiated in parallel: the team has, for several months, been running an agent whose entire learning loop is organized around what a recent independent synthesis termed win-independent signal — the loose family of intrinsic-motivation, structural-novelty, and self-progress signals that allow an agent to learn in environments where extrinsic reward is sparse, delayed, or absent. Four sub-families recur in that literature: curiosity (Pathak et al., Curiosity-Driven Exploration by Self-Supervised Prediction, ICML 2017, arXiv 1705.05363; Burda et al., Exploration by Random Network Distillation, ICLR 2019, arXiv 1810.12894), saliency (the Itti-Koch bottom-up attention lineage, plus zero-cost neural-network-pruning saliency proxies such as SNIP, Lee et al., ICLR 2019, arXiv 1810.02340), progress (Oudeyer and Kaplan's Learning Progress framework, Frontiers in Neurorobotics, 2007), and graph-structural zero-cost proxies (Synflow, Tanaka et al., NeurIPS 2020, arXiv 2006.05467). Friston's free-energy minimization (Friston, The free-energy principle: a unified brain theory?, Nature Reviews Neuroscience 2010) sits in the same family at the foundational layer; Langton's edge-of-chaos λ-classification (Langton, Computation at the edge of chaos, Physica D 1990) sits at the regime-classification layer.

The internal team's implementation pillars map cleanly onto the four sub-families: a value model V(s) supplies the progress signal, a perceptual module supplies the structural-novelty signal, a saliency bridge (Itti-Koch-derived) supplies the bottom-up attention signal, and a preference-learning loop supplies the intrinsic-preference signal. Four pillars, four sub-families, one architecture. The team did not coordinate with any of the cited authors. The convergence is structural, in exactly the sense §3 catalogues for the routing cluster. The same pattern repeats: when the substrate is ready (sparse-reward agent learning) and the kernel is available (intrinsic motivation as a unifying frame), independent teams arrive at structurally similar architectures.

I am citing the internal record here not to advance a priority claim — convergent invention forecloses priority claims by construction — but to make the convergent-invention thesis concrete. The same pattern that produced the eight-paper routing cluster is producing, in parallel, an intrinsic-motivation cluster, and it is reproducing within a single stack ten times over. There are at least nine independent teams (the eight paper authorships plus the parallel internal team) on the routing architecture by mid-2026, and the June 2026 vendor-internal cascade described in §1 makes that count higher still. The actual number, across both clusters and across vendor and integrator layers, is almost certainly higher than anyone is counting.

§6 The Teece Moat

David Teece's 1986 paper Profiting from Technological Innovation (Research Policy 15:6, pp. 285-305) is the canonical economics treatment of the question that follows from §4 and §5: if architectural priority is real but unpatentable, what defends the value of having gotten there first?

Teece's answer is co-specialized complementary assets. The innovator does not extract value from the innovation itself (which is imitable) but from the surrounding assets that the innovation requires to be valuable — assets that take time to acquire, that are specific to the innovation, and that the imitator cannot easily replicate.

In Teece's 1986 case studies the complementary assets were manufacturing capacity, distribution channels, brand, and after-sales service infrastructure. In the 2026 routing-architecture case, the complementary assets are different but structurally analogous:

Operational telemetry: Ten thousand-plus production dispatches across heterogeneous workloads is a dataset. The data is specific to the architecture. A new entrant cannot acquire it without running the architecture for an equivalent period.
Cost-routing calibration: The router decisions that minimize cost-quality regret depend on per-vendor pricing, per-model latency distributions, per-task quality patterns. These calibrations are co-specialized — they are valuable only in combination with the routing architecture, and they take continuous workload to acquire.
Operational discipline around budget and failure modes: Knowing which cloud-vendor failures cascade into which user-visible symptoms, which gating thresholds produce which cost regressions, which stuck-level escalations actually recover stuck queries — and, per §5, which stages over-escalate and quietly burn cost without adding information — are operational facts that take production exposure to learn.
Substrate control: A multi-node ecosystem with consumer-GPU, workstation-GPU, NPU, and TPU surfaces is itself a co-specialized asset. The routing architecture is more valuable to a team that owns the substrate than to a team that rents inference capacity from a single cloud vendor.

The Teece moat is real. It is also honest. It does not rest on a priority claim that the convergent-invention frame forecloses. It rests on the observation that running the architecture for years before it became famous produces calibrations, telemetry, and operational discipline that are difficult to replicate quickly. The convergent-invention paper is the stronger paper because it is defensible against challenge. A priority paper would be vulnerable to the eight cluster members' counter-claims; a Teece-moat paper depends only on what was actually built and run.

This essay's frame is therefore explicitly Teece-not-priority. The parallel internal team is not first. The team is one of many. What the team has is the co-specialized complementary asset stack that follows from running the architecture longer, on more substrate, against more workload, than most of the cluster.

§7 Implications for Enterprise

Industry reporting through April 2026 has documented an emerging enterprise pattern: large ERP vendors are constructing tollgates in front of agent-platform integrations against their systems of record. The reporting frame is security: enterprise customers cannot accept agent platforms that have direct write-access to ERP data without a tollgate that audits, rate-limits, and constrains the agent's behavior at the API surface.

This is the ERP-shaped instantiation of a question every enterprise will face by end of 2026: what sits between the agent and the system of record?

The cluster in §1 has been arriving at the answer for three years. Di Gioia's Cascade-Aware Multi-Agent Routing (2603.17112) is the cleanest articulation: the router is the layer above the agent. An enterprise that deploys an agent platform without a routing-and-policy layer above it is deploying an agent platform without a tollgate. That is exactly the gap the ERP tollgates are filling. And as of June 2026, the convergence is no longer confined to the integration layer: when a frontier vendor ships a classifier that re-routes a request to a different model mid-session as a safety mechanism, the vendor is conceding the same architectural point from inside the model — that a gate above the answer, deciding which engine should produce it, is load-bearing infrastructure, not an optional wrapper.

The architectural answer is not new (see §1, §2). The market-readiness is. Enterprises buying agent platforms in mid-2026 are discovering that they need a routing layer above the agent — a layer that classifies the request, decides which agent (or which model, or which deterministic pipeline) should handle it, enforces policy, audits the dispatch, and constrains the cost-and-blast-radius envelope. The ERP tollgate is one instantiation. The vendor-internal classifier cascade is another. There will be more. The cluster has been laying the architectural foundation for all of them.

For the convergent-invention community — the eight cluster members, the frontier vendor that just shipped the pattern inside its own stack, and the parallel teams who arrived at the same architecture independently — there is a shared defensive interest in this moment. Routing-as-the-layer-above-agent-execution is the architectural answer to a security question that enterprises are asking right now. The framing matters. If it is framed as "agent platform X is dangerous, so we need a tollgate," the conversation goes one way. If it is framed as "agent execution is the lower layer; routing-and-policy is the upper layer; you need both," the conversation goes a different way. The second framing is honest, defensible, architecturally correct, and convergent. It is the framing the cluster's papers have been earning for three years.

For practitioners building or buying in this space, the actionable takeaways are:

The architecture is not new. The cluster is real and dates to 2023. Prior art is dense, public, and structurally consistent — and as of June 2026 it includes a frontier vendor shipping the pattern natively inside its serving stack.
The patent floor is not empty of implementation claims, but it appears (subject to qualified patent counsel) to be empty of architectural-pattern claims. This is a feature of the kernel being centuries-old reliability theory, and of the structural pattern in which routing-cluster algorithms become commodities through commercial iteration rather than IP enforcement (see §4, the OCEAN/Arcball/Omniscient-Marketing case studies).
Operational telemetry from running the architecture under real workload is the durable asset. Teece-moat reasoning applies. And honest per-stage probability estimation — calibrating gates against both over-confidence and over-escalation (§5) — is part of that operational discipline.
The enterprise opportunity is the routing layer above agent execution, not a competing agent platform. The cluster has been pointing at this for three years, and the vendors are now pointing at it too.
None of us got here alone. Convergent invention is the honest frame, and it is also the strategically stronger frame. Priority claims are vulnerable. Convergent-invention claims are defensible because they are true.

This essay was developed using a multi-frontier-model consensus drafting process with WayneColt friction-layer review. The drafting framework employed multiple commercial frontier models in parallel followed by reconciliation passes; specific model rosters and consensus metrics available on request. Final editorial responsibility rests with the named author.