Building the predictive physics of multi-agent AI

The science of agentic scaling.

Teams of AI agents are deployed on intuition. ILLUME turns that guesswork into law: when more agents help, when they fail, how they scale, and what they cost in energy.

Read the research How it works

1 Run a 5-agent pilot 2 Fit the scaling law 3 Extrapolate to 30

A full 30-agent sweep is expensive. We start with a cheap pilot.

2.1–3.4×

More tokens burned by debate for equal or worse accuracy

85.5%

Peak rate agents abandon correct answers to conform

N≤5→30

A 5-agent pilot predicts a 30-agent team

R²>0.99

Scaling-law fit across 38 model × task cells

The problem

Multi-agent AI is engineered by guesswork.

Compound AI systems orchestrate swarms of language models — debating, critiquing, voting — with no theory for when collaboration helps. Three structural failures hide inside production pipelines today.

Non-monotonic scaling

Adding agents can produce synergy, saturation, or collapse depending on topology and error correlation. Accuracy often peaks early, then degrades.

Hidden energy cost

Collaboration multiplies inference calls. Token consumption can rise 14× for trivial or negative gains — a thermodynamic inefficiency no one accounts for.

No native metrics

Reported gains often reflect variance reduction, not real reasoning. Without agent-native measures we can't tell intelligence from ensembling noise.

What we build

From heuristic engineering to a predictive physics of interaction.

We model an agent team as a thermodynamic system where intelligence is a function of information velocity and energy expenditure. Four pillars turn that into deployable tools.

Governing laws

Formal models of agentic teamwork on graph and hypergraph dynamics — capturing instant memory cloning and hallucination cascades that human-team theory can't.

Scaling regimes

Mapping the phase transitions — synergy → saturation → collapse — where sycophantic drift and coordination overhead consume the marginal agent.

Agent-native metrics

The Artificial Collective Intelligence (ACI) factor — grounded in information theory, not human IQ — to compare systems under compute parity.

Thermodynamic limits

The Energy–Utility Pareto frontier: the absolute minimum Joules per bit — measured lifecycle-wide with eCAL — of collective reasoning gain.

The scaling law

One equation that classifies any agent team.

We measure effective team size — how many of your N agents actually contribute independent evidence. A two-parameter fit from a tiny pilot predicts large-team behavior and whether adding agents will ever pay off.

R(N) = N_eff/N = 1 / (1 + c(N−1)N^−β)

Two interpretable parameters: c sets the efficiency floor; β controls how fast added agents stop contributing. Estimated on N ≤ 5, it extrapolates to N = 30 at under 12% error.

β = 0Hard ceiling — more agents add nothing.

0 < β < 1Sublinear — diminishing but real gains.

β ≥ 1Linear — every agent still counts.

The same form describes debate, self-correction, noise placebos and even classical human group studies — just different points in (c, β).

Peer-reviewed research

The work behind the science.

Grounded in controlled, open-weight experiments across multiple model families and reasoning benchmarks — not anecdotes.

CAIS '26★ Industry Spotlight

The Cost of Consensus: Isolated Self-Correction Prevails Over Unguided Homogeneous Multi-Agent Debate

A controlled study of N=10 homogeneous debate teams across three open-weight models on GSM-Hard and MMLU-Hard, decomposing debate failure into three mechanistic pathways.

Sycophantic conformity: agents adopt the majority answer up to 85.5% of the time
Contextual fragility: correct reasoning destabilized at rates up to 70%
Debate costs 2.1–3.4× more tokens for equal or lower accuracy

Selected as an Industry Spotlight paper — invited to present at the AI Engineer World's Fair (San Francisco, June 2026), in front of 6,000+ practicing engineers.

arXiv:2606.02646

The Ringelmann Effect in Multi-Agent LLM Systems: A Scaling Law for Effective Team Size

A two-parameter scaling law that classifies any agent configuration into hard-ceiling, sublinear, or linear regimes — validated across 38 model × task × condition cells.

A single functional form fits every condition at R² > 0.99
An N ≤ 5 pilot predicts the N = 30 structural ceiling
Only architectural diversity escapes the ceiling — more talk does not
Energy scales O(N) while accuracy stays sublinear — a Ringelmann Energy Ratio of ~0.06 at N = 10 (≈23× the compute for marginal returns)

From the ILLUME research programme

IEEE JSAC 2026⚡ Energy backbone

eCAL: the energy cost of the AI lifecycle

Before you can price collective reasoning in Joules, you have to measure it. eCAL is the first metric to add up the full energy of an AI system — data collection, preprocessing, training, evaluation and inference — end to end, in Joules per bit. It is the measurement backbone beneath ILLUME's Energy–Utility frontier.

Chou, Hribar, Hanžel, Mohorčič, Fortuna · SensorLab, Jožef Stefan Institute · DOI · Project · Code

Joules / bitone lifecycle-wide unit, across 7 OSI layers

2.73×lower energy per bit at 10k vs 100 inferences — amortization

dev → inferencetwo regimes: development- vs inference-dominated

Why ILLUME

Scaling laws gave training a shared unit. We give it to inference.

Predict before you spend

Estimate large-team behavior from a five-agent pilot — and know if more agents will ever pay off before you provision the compute.

Compare under compute parity

Agent-native metrics benchmark architectures fairly, separating real reasoning from variance reduction and lucky ensembling.

Optimize for Joules

Place every workflow on the Energy–Utility frontier and design collectives that are scalable, comparable, and sustainable.

Get in touch

Let's make agentic AI predictable.

Whether you deploy multi-agent systems, fund the science, or want to collaborate on the physics of collective intelligence — we'd like to hear from you.

Contact the team Explore the research