Should Economics Attempt to Replicate the Natural Sciences?

– February 5, 2018

In my last post on E.C. Harwood’s vision of a standardized methodological approach to the study of economics using the scientific method, I (as well as Harwood) assumed that the field of economics should emulate the natural sciences. A phenomenon known as the Replication Crisis, however, is challenging the seemingly unquestionable authority and accuracy that a published study in the natural sciences is expected to have, inviting the question as to whether economists should aspire to the standards of natural scientists or to something better.

The Replication Crisis began in 2015 when a group of 270 contributing authors undertook what they called the Reproducibility Project and attempted to replicate 100 studies that had been published in three of the top psychology journals: Psychological Science, Journal of Personality and Social Psychology, and Journal of Experimental Psychology. The study found that of the 97 publications that claimed to have statistically significant results, only 35 were replicable (36 percent). That is an alarming figure. Since then, the crisis has moved beyond psychology as scientists across disciplines have taken a critical look at their own methodologies. Though the rates at which studies have failed to replicate are most startling in the fields of psychology (specifically social psychology), biology, and medicine, no field remains untouched.

The reasons for these surprising statistics at a time when the science bureaucracy pumps out research studies at an awesome rate are rather unsurprising. One cause is a scientist’s desire to discover something new and extraordinary. Although this may come in the form of individual ambition, it is propagated far more dramatically by research institutions. Scholarships, job positions, tenure, funding, honors, and publishing deals are all awarded with a bias toward scientists or studies that have purported to find the most astonishing connections. Thus, a study that claims to have found a connection between two variables is far more likely to receive a reward than a study that found no connection — even if multiple studies found no connection. Most hypotheses that are tested turn out to be false, and thus, a majority (or at least a significant number) of publications in scientific journals should report no findings. Yet almost every published study reports the presence of an association.

Hard data aid researchers in their quest for discovery. Despite the authority that a claim backed by large data sets wields, researchers with the best intentions can manipulate data to yield their most favored outcome. If one were to study the relationship between the political party in power and the state of the economy, one could ethically obtain an outcome that supports any desired claims, simply by including or excluding certain data points. As shown by this interactive data set published by FiveThirtyEight.com, contradictory conclusions with p-values of 0.05 or less (the standard for statistical significance generally required for publishing) can be achieved. Outcomes can easily be deduced; answers require a bit more rigor.

This kind of leeway (or error) in data, and research more generally, is best understood with an application of Bayesian statistics. Say you are confronted with a field containing 101 rocks, 1 of which contains a diamond. You then create a device that, with 99 percent accuracy, determines whether a rock contains a diamond. You would think that after surveying some rocks and coming upon one that sounds the alarm, you have found with 99 percent assurance the rock that contains the diamond. In fact, there is only a 50 percent chance that this rock is indeed the rock containing the diamond. This is because your device will, on average, sound twice in a field of 101 rocks — once for the correct rock and once for a false positive. Consider, then, a study of human genes, where one mutation among 20,000 is the cause of a specific disease. To even achieve the same 50 percent confidence level in identifying the correct gene would require an almost inconceivably accurate method. What is the standard of accuracy required for publishing scientific studies? A p-value of 0.05 — that is, 95 percent accuracy. In other words, there need be only a 20 percent chance that your device has identified the correct rock.

Though the scale at which studies fail to replicate is shocking, that scientific endeavors come up with false conclusions should not come as a surprise to anyone. Harwood advocated reaching conclusions known as warranted assertions: descriptions of reality subject to modification or dismissal upon later inquiries, which would, in theory, allow for a self-corrective process. There are, unfortunately, institutional problems hindering the process of reviewing and revising.

The first is simply an imprecise reviewing process. In a test run on the reviewers of The British Medical Journal, 221 reviewers were given a study that had been deliberately modified to include eight different major errors in study design, methodology, data analysis, and data interpretation. On average, the reviewers only found two of the mistakes, and only 30 percent of the reviewers recommended rejecting the intentionally flawed study.

A more concerning second problem, however, is an institutional bias toward conservatism. Established older scientists who happen to have built their careers on findings that, for whatever reason, turn out to be false amass power, authority, and influence in their fields. They are, then, highly unlikely to support newer discoveries that disprove their theories. In turn, entire departments, organizations, and careers built on and dedicated to false discoveries use their reviewing apparatus to suppress young scientists pointing out the flaws in theories. As a result, the scientific method’s self-corrective potential is obstructed. As Max Planck famously wrote in 1950, “A new scientific truth does not triumph by convincing its opponents and making them see the light, but rather because its opponents eventually die, and a new generation grows up that is familiar with it.”

Harwood was aware of the influence that bias has upon the integrity of scientific research. Accordingly, he set up American Institute for Economic Research as a nonprofit that was independent of the influence of funding in 1933. He pointed out, “Although most people apparently assume that scientific considerations ‘automatically’ outweigh political considerations … history plainly suggests the opposite.… Seemingly, the more extreme the politics, the more extreme the distortions of scientific procedure.” He went on to say, “The maintenance of a climate favorable to genuinely independent inquiry … is by no means assured.” Accordingly, replication studies in experimental economics have returned a failure-to-replicate rate of 40 percent. The burden thus rests on economics as an institution to avoid the traps that its academic neighbors have fallen prey to by promoting truly independent inquiry, holding high and unwavering methodological standards, and committing itself to an unapologetic pursuit of accurate knowledge.

Henrik Palmer

Henrik Palmer grew up in Tyringham and graduated from Lenox Memorial High School in 2016. After initially intending to get a degree in Engineering, he is now a major in the College of Social Studies, an inter-disciplinary major covering History, Economics, Government and Philosophy, and he plans to double major in physics. Henrik attends Wesleyan University in Connecticut.