Cross-Component Interference Undermines LLM Agent Stacking

Researchers studying LLM agent scaffolding have found that stacking more components does not reliably improve performance, a phenomenon they term cross-component interference. In controlled experiments across 32 component subsets on HotpotQA and GSM8K using Llama-3.1-8B and 70B, a single-tool agent outperformed a fully equipped system by 32 percent on HotpotQA, while a three-component subset beat the all-inclusive configuration by 79 percent on GSM8K.

What Cross-Component Interference Is

The dominant assumption in LLM agent construction holds that layering scaffolding components, including planning modules, tool use, memory, self-reflection, and retrieval, produces progressively more capable systems. The paper challenges that assumption directly by defining cross-component interference (CCI) as the measurable degradation that occurs when components interact destructively rather than additively [1].

CCI is not a marginal edge case in the researchers’ framing. It is a systematic property of multi-component agent architectures that emerges from the combinatorial interactions among scaffolding layers. When two or more components are active simultaneously, their combined effect on task performance can be worse than either component operating alone, a result that contradicts the additive logic underlying most agent design defaults [1].

Experimental Design and Benchmarks

To measure CCI rigorously, the researchers constructed a full factorial experiment covering all 2^5 equals 32 subsets of five scaffolding components. The five components evaluated were planning, tool use, memory, self-reflection, and retrieval. Every possible combination was tested, producing 96 distinct experimental conditions [1].

The benchmarks selected were HotpotQA, a multi-hop question-answering task that stresses reasoning and retrieval, and GSM8K, a grade-school mathematics benchmark that tests multi-step arithmetic reasoning. Both Llama-3.1-8B and Llama-3.1-70B were used to assess whether findings held across model scales. Each condition was run with up to 10 random seeds to establish statistical reliability across the result set [1].

Key Performance Findings

The All-In configuration, meaning the system with all five components active, was consistently suboptimal across both benchmarks. On HotpotQA, a single-tool agent achieved an F1 score of 0.233 compared to 0.177 for All-In, a 32 percent improvement that reached statistical significance at p equals 0.023. On GSM8K, a three-component subset scored 0.43 against All-In’s 0.24, a 79 percent improvement significant at p equals 0.010 [1].

The optimal component count varied by task, ranging from k equals 1 to k equals 4 depending on the benchmark and model scale. Scale sensitivity was a notable finding: at the 70B parameter level, component combinations that degraded performance at 8B sometimes produced gains. Even so, the All-In configuration still trailed the best-performing subset at 70B, indicating that the problem does not simply dissolve with larger models [1].

The findings replicated across model families. Experiments using Qwen2.5 produced consistent results, and the interference pattern held when prompts were paraphrased, ruling out sensitivity to specific prompt wording as a confounding explanation [1].

Analytical Methods and Submodularity Violations

The researchers applied several analytical techniques to characterize the interference structure. A main-effects regression fitted to the experimental data produced an R-squared of 0.916, an adjusted R-squared of 0.899, and a leave-one-out cross-validation score of 0.872, indicating a strong but not complete fit and suggesting that interaction terms carry meaningful variance [1].

Exact Shapley values were computed to attribute performance contributions to individual components across all coalition structures. The more consequential finding came from a submodularity analysis: 183 out of 325 tested component pairs or groupings violated the submodularity property, a rate of 56.3 percent [1].

Submodularity violations matter practically because greedy component selection strategies, which add the highest-marginal-value component at each step, are only guaranteed to find near-optimal solutions when the value function is submodular. At a violation rate above 56 percent, greedy selection is statistically unreliable as a design heuristic, meaning practitioners who build agent systems by sequentially adding the next best-performing component have no formal guarantee of approaching an optimal configuration [1].

The Three-Body Synergy Finding

Among the interference patterns identified, the researchers also detected one notable positive interaction. The combination of Tool Use, Self-Reflection, and Retrieval showed a three-body synergy with an interaction coefficient of positive 0.175 and a 95 percent confidence interval of [+0.003, +0.351] [1].

The authors explicitly flag this result as exploratory. The confidence interval is wide, with the lower bound barely clearing zero, and the finding was not the primary hypothesis under investigation. The caution is appropriate given the multiple comparisons involved in a full factorial design across 32 subsets, and the paper does not recommend treating this specific triplet as a reliable design prescription [1].

Implications for Agent System Design

The paper’s central recommendation is that maximally-equipped agent defaults should be replaced by task-specific subset selection guided by interaction-aware analysis. The evidence across both benchmarks, both model scales, and multiple model families indicates that the common practice of enabling all available scaffolding components is more likely to introduce destructive interference than to maximize performance [1].

The submodularity violation rate makes the problem harder to solve with simple heuristics. Practitioners cannot rely on greedy addition or greedy removal strategies to reliably find good subsets. Instead, the researchers point toward systematic evaluation of component interactions, using tools such as Shapley value decomposition and factorial experimental designs, as the more defensible path to identifying configurations that work for a specific task and model combination [1].

FAQ

Q. Does increasing model size from 8B to 70B eliminate cross-component interference? No. While the 70B scale changes which specific component combinations produce gains versus losses, the All-In configuration still underperforms the best-performing subset at 70B. Scale reduces some interference effects but does not remove the underlying problem [1].

Q. Are these findings specific to Llama models, or do they generalize? The researchers replicated the CCI pattern using Qwen2.5 in addition to Llama-3.1-8B and 70B. The interference also persisted when prompts were paraphrased, suggesting the findings are not artifacts of a particular model family or prompt formulation [1].

Q. Why is greedy component selection unreliable given these results? Greedy selection is theoretically sound only when the performance function is submodular. The study found that 56.3 percent of component groupings violated submodularity, meaning the marginal value of adding a component changes unpredictably depending on which other components are already active [1].

Q. What is the practical minimum experiment needed to identify a good component subset? The paper used a full factorial design across all 32 subsets of five components with up to 10 seeds per condition. The authors recommend interaction-aware analysis, including Shapley value computation, rather than one-at-a-time ablations, which would miss the combinatorial interference effects [1].

Q. Should the three-body synergy among Tool Use, Self-Reflection, and Retrieval be used as a design rule? The authors explicitly label this finding exploratory. The 95 percent confidence interval for the interaction coefficient barely excludes zero at the lower bound, and the result emerged from a multi-comparison factorial design. It should not be treated as a reliable prescription without further targeted validation [1].

Key takeaways

Adding more scaffolding components to an LLM agent does not reliably improve performance. All-In configurations were consistently outperformed by smaller subsets on both HotpotQA and GSM8K [1].
A single-tool agent beat the fully equipped system by 32 percent on HotpotQA (F1 0.233 vs 0.177, p=0.023), and a three-component subset beat All-In by 79 percent on GSM8K (0.43 vs 0.24, p=0.010) [1].
Submodularity violations occurred in 56.3 percent of tested component groupings, making greedy component selection strategies formally unreliable for finding near-optimal agent configurations [1].
CCI replicated across Qwen2.5 and was robust to prompt paraphrasing, indicating the phenomenon is not model-family-specific or prompt-dependent [1].
The researchers recommend replacing maximally-equipped defaults with task-specific subset selection using interaction-aware analytical methods such as Shapley value decomposition and full factorial evaluation [1].