Teams of AI agents are deployed on intuition — sometimes they outperform a single model, sometimes they collapse into expensive groupthink. ILLUME turns that guesswork into law: when more agents help, when they fail, how they scale, and what they cost in energy.
Compound AI systems orchestrate swarms of language models — debating, critiquing, voting — with no theory for when collaboration helps. Three structural failures hide inside production pipelines today.
Adding agents can produce synergy, saturation, or collapse depending on topology and error correlation. Accuracy often peaks early, then degrades.
Collaboration multiplies inference calls. Token consumption can rise 14× for trivial or negative gains — a thermodynamic inefficiency no one accounts for.
Reported gains often reflect variance reduction, not real reasoning. Without agent-native measures we can't tell intelligence from ensembling noise.
We model an agent team as a thermodynamic system where intelligence is a function of information velocity and energy expenditure. Four pillars turn that into deployable tools.
Formal models of agentic teamwork on graph and hypergraph dynamics — capturing instant memory cloning and hallucination cascades that human-team theory can't.
Mapping the phase transitions — synergy → saturation → collapse — where sycophantic drift and coordination overhead consume the marginal agent.
The Artificial Collective Intelligence (ACI) factor — grounded in information theory, not human IQ — to compare systems under compute parity.
The Energy–Utility Pareto frontier: the absolute minimum Joules required per bit of collective reasoning gain.
We measure effective team size — how many of your N agents actually contribute independent evidence. A two-parameter fit from a tiny pilot predicts large-team behavior and whether adding agents will ever pay off.
Two interpretable parameters: c sets the efficiency floor; β controls how fast added agents stop contributing. Estimated on N ≤ 5, it extrapolates to N = 30 at under 12% error.
The same form describes debate, self-correction, noise placebos and even classical human group studies — just different points in (c, β).
Grounded in controlled, open-weight experiments across multiple model families and reasoning benchmarks — not anecdotes.
A controlled study of N=10 homogeneous debate teams across three open-weight models on GSM-Hard and MMLU-Hard, decomposing debate failure into three mechanistic pathways.
A two-parameter scaling law that classifies any agent configuration into hard-ceiling, sublinear, or linear regimes — validated across 38 model × task × condition cells.
Estimate large-team behavior from a five-agent pilot — and know if more agents will ever pay off before you provision the compute.
Agent-native metrics benchmark architectures fairly, separating real reasoning from variance reduction and lucky ensembling.
Place every workflow on the Energy–Utility frontier and design collectives that are scalable, comparable, and sustainable.
Whether you deploy multi-agent systems, fund the science, or want to collaborate on the physics of collective intelligence — we'd like to hear from you.