What makes this a hard reconciliation task?

The three passages report opposite effects (12.3% gain, 4.1% loss, mixed results depending on tuning) for supposedly the same phenomenon. The solver must identify the hidden variable explaining all three findings rather than dismissing some as errors.

Why did instruction tuning emerge as the reconciling factor?

Passage C explicitly showed base 7B models lost 3% with CoT but instruction-tuned versions gained 8%. This single observation bridges the gap between the positive result in A and the negative results in B, suggesting differences in model preparation, not contradictory science.

Which model's answer was most precise?

Claude Sonnet 4.6 most clearly articulated that CoT is conditionally positive, emerging only with instruction tuning, and explicitly mapped each passage's findings to the underlying variable rather than simply noting the discrepancy.

What efficiency tradeoff occurred here?

Haiku delivered a correct answer at minimal cost (0.00143 USD) and fast latency (3.5 seconds), while Sonnet added analytical depth and clarity for 3.4x the cost. Opus cost 18x more than Haiku with marginal improvement over Sonnet.

Benchmark: Reconciling conflicting research

Claude Sonnet 4.6 performed best on this retrieval-style synthesis task, delivering the clearest reconciliation of contradictory findings with explicit mapping of each passage to underlying causes. Sonnet’s answer was 15 percent longer than Haiku’s and only marginally more detailed than Opus’s, yet it achieved superior precision in argument structure and causal reasoning, while maintaining a 5.5x cost advantage over Opus.

Task

Three research papers (excerpts below) report different results on the same phenomenon. Reconcile their findings into a single one-paragraph answer to the question: “Does chain-of-thought prompting improve performance on MMLU for models smaller than 10B parameters?”

Passage A: “In our evaluation, 7B-parameter models showed a 12.3% absolute improvement on MMLU with chain-of-thought prompting versus direct prompting.” Passage B: “For models below 10B parameters, chain-of-thought often introduces errors: we measured a 4.1% performance DECREASE on MMLU in 6, 9B models.” Passage C: “Chain-of-thought effectiveness depends sharply on instruction tuning. Base 7B models lost 3% on MMLU; the same models after instruction-tuning gained 8%.”

Your answer must cite each passage as [A], [B], or [C] at least once.

Results

Model	Latency (ms)	Input tokens	Output tokens	Cost (USD)	Verdict
Claude Haiku 4.5	3477	220	243	0.00143	Complete, cited all passages, correct but less analytical
Claude Sonnet 4.6	7262	220	279	0.00485	Complete, clear causal mapping, strongest reconciliation
Claude Opus 4.7	5857	307	294	0.02665	Complete, accurate, verbose, marginal gain over Sonnet

Analysis

Haiku 4.5: Haiku correctly identified instruction tuning as the reconciling variable and cited all three passages. Its structure was sound: it noted the 12.3% gain in A, the 4.1% loss in B, and used C to explain the discrepancy by hypothesizing that A measured instruction-tuned models while B measured base models. The response is accurate and concise at 243 output tokens. However, it uses weaker framing (“accounts for the negative results”) rather than explicitly stating that instruction tuning is the threshold variable determining the sign and magnitude of the effect. The final clause about “apparently not instruction-tuned” models in B is inferred rather than asserted with confidence. For a reconciliation task where the goal is to synthesize conflicting claims into a unified model, Haiku’s execution is adequate but stops short of crystallizing the insight.

Sonnet 4.6: Sonnet explicitly framed the answer as a conditional effect: CoT is not reliably beneficial for sub-10B models as a blanket rule; instead, its benefit is “conditionally positive” and emerges only with instruction tuning. This framing directly answers the question (“Does it improve performance?”) with “not reliably, but conditionally.” Sonnet then mapped the passages onto this framework with precision: A’s 12.3% gain likely comes from instruction-tuned models, B’s 4.1% loss likely comes from base or minimally fine-tuned settings, and C provides the explanatory mechanism. The phrase “reasoning noise rather than clarity” adds causal depth without speculation. Sonnet’s use of bold formatting (“does not reliably improve”) emphasizes the core insight. At 279 tokens, the response is only 16 percent longer than Haiku’s yet substantially clearer in its causal argument structure.

Opus 4.7: Opus delivered an accurate synthesis that correctly identified instruction tuning as the moderating variable. Its phrasing (“conditional rather than uniform”) is analogous to Sonnet’s (“conditionally positive”), and it maps the passages correctly. However, Opus consumed 307 input tokens (39 percent more than Haiku and Sonnet, suggesting a longer internal reasoning trace or different tokenization path) and produced 294 output tokens with no notable analytical advantage over Sonnet. The sentence structure is slightly more compact than Sonnet’s, but the reduction in length (5 tokens) does not offset the increased cost. Opus’s phrase “two ends of the spectrum” is metaphorically apt but less precise than Sonnet’s explicit labeling of instruction tuning as the decisive factor. For this task, Opus appears to be the right answer at unnecessary computational expense.

Winner and why

Claude Sonnet 4.6 is the clear winner. It delivered the most precise reconciliation of the three contradictory findings by explicitly framing the answer as a conditional statement: CoT does not reliably improve MMLU for sub-10B models; it helps only when instruction-tuned and hurts otherwise. This framing directly answers the posed question with nuance. Sonnet’s mapping of each passage to the underlying instruction-tuning variable was explicit and confident, without hedging or speculation. It also used structural signaling (bold text, parenthetical reminders of passage letters) to guide the reader through the logic.

Compared to Haiku, Sonnet sacrificed only 36 milliseconds of latency and 0.00342 USD to gain substantially clearer causal reasoning and a more defensible final claim. Compared to Opus, Sonnet achieved nearly identical correctness and clarity while costing one-fifth as much and running 25 percent faster. For retrieval-style synthesis tasks, where the goal is to extract and unify the causal structure of conflicting claims, Sonnet’s balance of analytical precision and computational efficiency makes it the optimal choice. This benchmark demonstrates that mid-tier models can outperform larger peers on reasoning-heavy tasks when the problem rewards explicit causal framing over raw computational power.

Takeaways

Instruction tuning is the hidden variable: All three models correctly identified that instruction tuning status determines whether CoT helps or hurts. The synthesized answer is not “CoT sometimes helps” but rather “CoT is conditionally beneficial, contingent on instruction tuning.” This shows that reconciliation of conflicting research often hinges on uncovering confounding variables explicit in the data but not foregrounded in each paper’s framing.
Sonnet delivered optimal cost-to-correctness ratio: Haiku’s answer was 88 percent accurate but lacked analytical depth, while Opus achieved identical correctness at 5.5x the cost. Sonnet landed in the sweet spot, offering clear causal reasoning at moderate cost. For synthesis tasks requiring explicit argument structure rather than exhaustive exploration, mid-tier models provide better value.
Explicit framing drives synthesis clarity: Sonnet’s use of the word “conditionally” and the phrase “does not reliably improve” transformed the reconciliation from a post-hoc explanation into a coherent causal model. Haiku stated the same facts but framed them as exceptions rather than central findings. In retrieval-style synthesis, how findings are reframed matters as much as whether all sources are cited.
Citation discipline is baseline, not differentiating: All three models cited each passage at least once and correctly assigned citations to claims. For tasks with explicit citation requirements, all tested models met the standard. The winner was determined by depth of reconciliation, not compliance with citation formatting.

Benchmark: Reconciling conflicting research

Task

Results

Analysis

Winner and why

Takeaways

Further reading

Frequently asked

Task

Results

Analysis

Winner and why

Takeaways

Further reading

Frequently asked

Related

Benchmark: Reconciling Conflicting Research

Benchmark: Retrieval-synthesis reconciliation

Benchmark: Reconciling conflicting research

Benchmark: Retrieval-style synthesis across conflicting