Wisdom of Crowds

Correlation Can Be Good:
A Literature Review

Wisdom of Crowds Research — Process Matters, Not Just Correlation

A widespread view holds that correlated errors are the enemy of collective intelligence. This review examines the empirical and theoretical record and argues the framing is too simple. What matters is not how much correlation exists among estimates, but what process generates it.

1. Classic Wisdom of Crowds — the Baseline

The modern scientific study of collective estimation begins with a brief note in Nature by Francis Galton. At a 1906 country fair in Plymouth, England, roughly 800 visitors paid sixpence each to guess the dressed weight of a slaughtered ox. Galton analysed the 787 legible entries and reported that the median estimate was 1,207 pounds — within nine pounds of the actual weight of 1,198 pounds (Galton, 1907). The crowd, Galton concluded, performed as well as any expert, despite almost none of the contestants being professional butchers.

This demonstration encapsulates the core phenomenon: when independent estimates are aggregated, individual errors — some positive, some negative — partially cancel, leaving the aggregate closer to the truth than any single contributor. The conditions that favour this cancellation were synthesised by Surowiecki (2004), who identified four requirements: (1) diversity of opinion — each person holds some private information; (2) independence — opinions are formed without undue influence from others; (3) decentralisation — contributors draw on local or specialist knowledge; (4) aggregation — a mechanism exists to combine private judgments into a collective answer.

Baseline empirical result: Across a wide range of estimation tasks — weight, area, temperature, historical dates, probability forecasts — the simple mean or median of independent estimates reliably outperforms the average individual contributor (Galton, 1907; Surowiecki, 2004; Mannes et al., 2014). The crowd advantage grows with crowd size up to a diminishing return, and it is largest when individual errors are large and uncorrelated.

2. The Formal Basis: Diversity and Error Cancellation

The mechanism underlying wisdom of crowds can be stated precisely. Let be the crowd mean, T the true value, and xi each individual estimate. The squared error of the crowd mean decomposes as:

Crowd MSE = Mean individual MSE − Var(xi)
i.e., (x̄ − T)² = E[(xi − T)²] − Var(xi)

This identity — sometimes called the diversity prediction theorem (Page, 2007) — shows that crowd error is always strictly less than average individual error by an amount equal to the variance of the estimates. More variance means more error cancellation and a wiser crowd. The identity holds exactly for any set of estimates; it is an algebraic fact, not an assumption.

The implication for correlation is immediate: if errors are positively correlated, the variance of the estimates is inflated by covariance terms, but crucially the identity still holds. What correlated errors do is reduce the effective diversity available for cancellation. Systematic errors — everyone guessing too high because they share a common bias — do not appear in the variance term and do not cancel. The crowd mean inherits the bias in full.

In practice, Hogarth (1978) and others observed that people who share common backgrounds, receive common training, or use common information sources tend to make similar errors. These shared errors do not cancel on aggregation. The question is how serious this problem is in practice, and whether it can be remedied.

3. The Correlation Critique — Why People Worry

The most influential empirical study of correlation-induced crowd failure was conducted by Lorenz, Rauhut, Schweitzer, and Helbing (2011) at ETH Zürich. In a controlled laboratory experiment (N = 144), participants answered factual estimation questions across five successive rounds. A control group received no feedback between rounds; treatment groups were shown the average estimate of the group or the full distribution of all responses.

The results were striking and negative. Even mild exposure to others' estimates caused three damaging effects:

The Lorenz et al. (2011) finding: Social influence reduced diversity without improving accuracy, producing a crowd that was more correlated, more confident, and no wiser. The authors concluded that "even mild social influence can undermine the wisdom of crowd effect."

This finding resonated widely and shaped the prevailing view that independence is a strict requirement for crowd wisdom. Many practitioner reviews concluded that polling, deliberation, and social media all risk contaminating crowd estimates by introducing correlation. The Lorenz result is correct as stated — it accurately describes what happens when social information takes a particular form. The problem is that this particular form is not the only form.

4. The Counterargument — When Correlation Is Benign or Beneficial

The key shortcoming of the "correlation is bad" framing is that it conflates two very different causal stories. Estimates can become correlated because (a) people are anchored by uninformative social information that moves everyone toward a shared but wrong position — the mechanism in Lorenz et al. — or because (b) people update toward a more accurate peer whose information is genuinely useful. Case (a) produces harmful correlation; case (b) produces correlation that improves or at worst does not worsen collective accuracy. The level of correlation does not determine which case obtains. The process that generates the correlation does.

4.1 Calibrated influence

If the most accurate individuals exert the most influence in a group — formally, if the correlation between influence weight and pre-influence accuracy is positive — then social updating tends to pull everyone toward the truth. The resulting correlation among estimates is therefore a symptom of group learning rather than group error propagation. Becker, Brackbill, and Centola (2017) show this empirically: in decentralised social networks, where influence is distributed across many nodes, this calibration property tends to hold naturally because the most accurate individuals provide information to many others while receiving relatively little themselves.

4.2 Structured deliberation

Navajas et al. (2018) tested a deliberation paradigm with a live audience of 5,180 people. Participants first answered general-knowledge questions independently, then deliberated in small groups of five, reached a single consensus estimate, and finally provided revised individual estimates. Averaging the consensus decisions from just four groups of five was substantially more accurate than the wisdom-of-crowds average of the full original sample of 5,180 independent estimates.

The consensus estimates were by construction perfectly correlated within each group — every member of a group converged on the same number. Yet the resulting crowd-of-groups was more accurate. The reason is that deliberation tended to expose and eliminate outlier errors, and the winning consensus in each small group tended to be near the truth. The correlation was beneficial because the process generating it was calibrated: bad estimates were argued down rather than averaged in unchanged.

4.3 Sorted social information

Jayles et al. (2021) tested whether the format of social information matters. In a large online experiment (N ≈ 1,000), participants estimated quantities and were then shown previous participants' estimates in one of several formats: sorted (ascending order), random order, or an aggregate (geometric mean only). Sorted information improved individual accuracy more reliably than random-order information. The sorted condition introduced systematic directional correlation among updaters — everyone shifted in the same direction — yet this correlation was on average beneficial because sorting made the range of informed opinion legible and allowed participants to place their own estimate in context.

4.4 The mathematical condition

The general condition under which a correlation-inducing process improves collective accuracy follows from the diversity prediction theorem. After social influence, both the mean individual error and the variance of estimates change. Crowd accuracy improves if and only if mean individual error decreases by more than the variance decreases — i.e., the signal component of updating exceeds the diversity cost. This condition holds when calibration (the correlation between influence weight and accuracy) is positive and larger in magnitude than herding (the correlation between influence weight and being close to the group mean). Neither condition is captured by looking at the correlation among estimates alone.

The process question: Two social influence protocols can produce identically correlated final estimates. One may improve collective accuracy and one may worsen it. The difference lies in whether influence flowed from accurate to inaccurate individuals (calibrated) or from average to outlier individuals (herding). Measuring correlation alone cannot distinguish these cases.

5. Joshua Becker and Collaborators — A Research Programme

Joshua Becker (UCL School of Management, formerly Northwestern University) and collaborators have built a systematic empirical and theoretical programme around the question of when social influence helps or harms collective accuracy in numeric estimation tasks.

5.1 Network structure and crowd accuracy (Becker, Brackbill & Centola, 2017)

Becker, Brackbill, and Centola (2017) conducted controlled web-based experiments in which participants estimated factual quantities across two rounds, with network-mediated social information between rounds (N ≈ 1,200; 5 questions per session; multiple replications). The study varied whether networks were decentralised (every node receives information from many peers) or centralised (most information flows through a single hub node).

The mechanism identified was as follows: in decentralised networks, accurate individuals tend to have higher effective centrality (they influence more peers than they are influenced by), creating a positive correlation between influence weight and initial accuracy. The resulting convergence of estimates is therefore toward a more accurate position, and the resulting correlation among estimates is a sign of learning rather than error amplification.

Citation: Becker, J., Brackbill, D., & Centola, D. (2017). Network dynamics of social influence in the wisdom of crowds. Proceedings of the National Academy of Sciences, 114, E5070–E5076. doi:10.1073/pnas.1615978114

5.2 Partisan crowds (Becker, Porter & Centola, 2019)

Becker, Porter, and Centola (2019) extended the analysis to politically charged factual questions, testing whether politically homogeneous networks amplify partisan biases as predicted by "group polarisation" theories. In two web-based experiments, participants answered factual questions before and after observing the estimates of peers in a politically homogeneous social network.

Against the predictions of polarisation theory, social information exchange in homogeneous partisan networks increased accuracy and reduced polarisation. Even when all group members shared a partisan prior, exposure to peer estimates moved beliefs toward the truth. The paper argues that calibrated updating — people recognising that peers who are somewhat closer to the truth provide useful signal — outweighed echo-chamber reinforcement in these conditions.

Citation: Becker, J., Porter, E., & Centola, D. (2019). The wisdom of partisan crowds. Proceedings of the National Academy of Sciences, 116, 10717–10722. doi:10.1073/pnas.1817195116

5.3 Individual vs. crowd accuracy after communication (Pilgrim & Becker, 2024)

The most theoretically complete paper in this programme is Pilgrim and Becker (2024), available at arXiv:2407.00199. The paper derives a mathematical decomposition of how social communication changes both individual and group accuracy simultaneously, and validates the decomposition against six previously published datasets.

Core mathematical result (Pilgrim & Becker, 2024): When opinions converge after social influence, the change in individual mean squared error is always strictly better than the change in crowd MSE, by exactly the change in variance:

ΔIndividual_MSE = ΔCrowd_MSE − ΔVariance

Since variance almost always decreases when people communicate (opinions converge), individuals reliably improve their accuracy even when the crowd average gets worse.

The paper further decomposes the crowd-level accuracy change into three components:

These predictions are validated against six datasets: Gurcay et al. (2015), Navajas et al. (2018), Jayles et al. (2021), Becker et al. (2017), and two further studies. Across all datasets, the proportion of group-question trials in which an individual improves their accuracy after social influence substantially exceeds 50%, even in datasets where the crowd average worsens. This empirical regularity follows directly from the mathematical decomposition: variance reduction is near-universal when people communicate, and variance reduction mechanically produces individual improvement.

The implication for the correlation debate is direct. Correlation is not the right diagnostic variable. Variance, calibration, herding, and centralisation are the diagnostic quantities. Correlation is a consequence of these processes, not their cause. Two datasets with identical post-influence correlations can show entirely opposite effects on collective accuracy depending on whether calibration or herding dominated the updating process.

Citation: Pilgrim, C., & Becker, J. (2024). Communication reliably improves individual but not group accuracy in numeric estimates. arXiv:2407.00199. arxiv.org/abs/2407.00199

5.4 Estimates are not decisions (Pilgrim & Becker, 2024 — CI Conference)

A separate but related line of work addresses what happens when a crowd's estimates are converted into a collective decision by majority vote. Even if the crowd's average estimate correctly identifies the best option, a plurality vote can select a different option. The "vote–average decoupling" arises because individual rankings of options can disagree with the ranking implied by the averages, and majority vote aggregates rankings rather than magnitudes.

Pilgrim and Becker (2024, CI abstract) show analytically that at empirically calibrated parameters (signal-to-noise ratio S ≈ 0.5, three options, groups of ten), this decoupling occurs approximately 20% of the time. As options become harder to distinguish (smaller S), the failure rate approaches 50%. The result holds even with perfectly independent estimates — the failure is not a product of correlation but of the aggregation method itself.

This finding is relevant to the correlation argument because it shows that the choice of aggregation method (average vs. vote) matters as much as the level of correlation among estimates. A wise crowd can make an unwise choice simply because the wrong aggregation method was applied.

Citation: Pilgrim, C., & Becker, J. (2024). Estimates are not decisions: How wise crowds can make unwise choices. Collective Intelligence Conference 2024.

6. Related Empirical Work

6.1 Gurcay, Mellers & Baron (2015)

Gurcay, Mellers, and Baron (2015) studied the effect of social influence on estimation accuracy in a two-round paradigm. Approximately 200 participants answered 16 factual questions before and after receiving social information, in both bounded (percentage) and unbounded (open-ended number) formats.

Social interaction improved accuracy on average. Improvement was larger when participants updated toward accurate peers and smaller when they anchored on the group average indiscriminately. The paper introduced an early distinction between informative and uninformative social influence that prefigures the calibration framework later formalised by Becker et al. (2017) and Pilgrim & Becker (2024). The Gurcay et al. (2015) dataset is one of the six empirically validated in Pilgrim & Becker (2024).

Citation: Gürçay, B., Mellers, B.A., & Baron, J. (2015). The power of social influence on estimation accuracy. Journal of Behavioral Decision Making, 28, 250–261. doi:10.1002/bdm.1843

6.2 Navajas, Niella, Garbulsky, Bahrami & Sigman (2018)

Navajas et al. (2018) tested a structured deliberation protocol at a live science festival in Argentina with 5,180 participants. After independent first-round estimates on six general-knowledge questions, participants formed random groups of five, deliberated briefly, and produced a single consensus estimate per group. They then provided revised individual estimates.

Averaging the consensus decisions of just four groups of five was substantially more accurate than the wisdom-of-crowds average of the original 5,180 independent estimates. This result demonstrates that small structured interactions — which necessarily produce correlated estimates within each group — can dramatically outperform fully independent aggregation. The within-group correlation was a product of quality filtering: deliberation eliminated extreme errors, and consensus decisions were drawn toward the better-informed group members.

Citation: Navajas, J., Niella, T., Garbulsky, G., Bahrami, B., & Sigman, M. (2018). Aggregated knowledge from a small number of debates outperforms the wisdom of large crowds. Nature Human Behaviour, 2, 126–132. doi:10.1038/s41562-017-0273-4

6.3 Jayles et al. (2021)

Jayles et al. (2021) investigated whether the format in which social information is presented affects updating accuracy. In a large online experiment (N ≈ 1,000) spanning 10 estimation questions, participants received previous participants' estimates in one of three formats: sorted ascending order, random order, or an aggregate (geometric mean only).

Sorted social information improved individual accuracy more reliably than random-order information, and the aggregate-only condition outperformed both. The sorted condition introduced systematic directional correlation among updaters, yet this correlation was on average beneficial because sorted presentation enabled participants to locate the range of informed opinion and place their own prior estimate in context. The Jayles et al. (2021) dataset is also validated in Pilgrim & Becker (2024).

Citation: Jayles, B. et al. (2021). Impact of sharing full versus averaged social information on social influence and estimation accuracy. Journal of the Royal Society Interface, 18, 20210325. PMID:34314654

6.4 Lorenz, Rauhut, Schweitzer & Helbing (2011) — contextualised

As discussed in Section 3, Lorenz et al. (2011) provide the most widely cited empirical demonstration of harmful social influence. A methodological note important for interpreting this result: participants received either the group mean or the full distribution of all current estimates — neither of which is calibrated to individual accuracy. There was no mechanism by which accurate individuals could exert greater influence than inaccurate ones. The design therefore tests the specific case of uncalibrated, homogenising social influence, not social communication in general. Later work (Becker et al., 2017; Pilgrim & Becker, 2024) shows that adding calibration to the communication protocol reverses the Lorenz et al. finding.

Citation: Lorenz, J., Rauhut, H., Schweitzer, F., & Helbing, D. (2011). How social influence can undermine the wisdom of crowd effect. Proceedings of the National Academy of Sciences, 108, 9020–9025. doi:10.1073/pnas.1008636108

6.5 Summary table

Study N Social Information Format Crowd Accuracy Individual Accuracy
Galton (1907) 787 None (independent) Median within 0.8% of truth
Lorenz et al. (2011) 144 Group mean or full distribution (uncalibrated) Worsened (diversity destroyed) No improvement
Gurcay et al. (2015) ~200 Social interaction Improved on average Improved on average
Becker et al. (2017) ~1,200 Peer estimates via network (calibrated) Improved (decentralised networks) Improved
Navajas et al. (2018) 5,180 Small-group deliberation to consensus Strong improvement (4 groups >> 5,180) Revised estimates improved
Jayles et al. (2021) ~1,000 Sorted / random / aggregate Improved (especially aggregate > sorted) Improved
Becker et al. (2019) Homogeneous partisan peer network Improved; polarisation reduced Improved
Pilgrim & Becker (2024) 6 datasets Meta-analysis + decomposition Mixed (crowd); depends on calibration vs. herding Reliably improved (near-universal)

7. Open Questions and Conclusion

7.1 When does calibration hold in practice?

The theoretical case for beneficial correlation rests on the calibration assumption: accurate individuals must exert disproportionate influence. In structured experiments with decentralised networks this holds empirically (Becker et al., 2017). But in real-world social media, influence is more often a function of social status, posting frequency, or algorithmic amplification than of epistemic accuracy. The conditions under which naturally occurring social systems achieve calibrated influence remain an open empirical question.

7.2 Individual vs. crowd accuracy trade-off

Pilgrim and Becker (2024) document a systematic split: individuals nearly always improve after social influence, but the crowd average does not always improve. This raises the practical question of which outcome organisations should target. If a firm uses the crowd mean to make a decision, crowd accuracy is the relevant metric; if it relies on individual judgments post-discussion, individual accuracy matters. The appropriate benchmark depends on the aggregation method used downstream. The finding that voting and averaging can diverge even for a perfectly accurate crowd (Pilgrim & Becker, 2024, CI abstract) underscores that aggregation method and estimation accuracy are two separate design choices.

7.3 The role of task type

Most empirical work in this tradition uses numeric estimation tasks because they permit clean measurement of error via mean absolute error or mean squared error. Whether the calibration-vs-herding framework generalises to qualitative decisions, probability forecasting tournaments, or creative tasks remains to be established. The formal decomposition of Pilgrim & Becker (2024) applies to any task where error is measurable on a numeric scale, but the empirical validation is currently limited to factual estimation.

7.4 Conclusion

The central argument of this literature: The received view that "correlation is bad for crowds" is not wrong, but it is incomplete. It correctly identifies one mechanism — uncalibrated homogenisation, as demonstrated by Lorenz et al. (2011) — that destroys diversity without improving accuracy. It misses a second mechanism — calibrated social learning, as demonstrated by Becker et al. (2017) and formalised by Pilgrim & Becker (2024) — in which communication improves both individual and often collective accuracy even as estimates become more correlated. The distinguishing variable is the process generating the correlation, not the correlation itself. What matters is whether more accurate individuals exert more influence.

The diversity prediction theorem (Page, 2007) is not challenged by any of the empirical evidence reviewed here; it is an algebraic identity. What the new work shows is that social influence can simultaneously reduce variance (weakening cancellation) and reduce mean individual error (reducing the need for cancellation), and that the second effect reliably dominates the first at the individual level across the six datasets studied. The crowd-level outcome depends additionally on whether calibration or herding dominates the updating process, which is a function of network structure and information format rather than of communication per se.

Practitioners designing collective intelligence systems should therefore ask not "how do I keep people independent?" but "how do I ensure that accurate people have more influence?" The former is often impossible and sometimes counterproductive; the latter is a tractable design goal with clear empirical guidance.

References