The Reproducibility Crisis in Machine Learning-Based Science

Pseudodragon Newsletter #2

Jul 27, 2022

Why you are here: you are interested in bimonthly longform essays, think pieces, and curated news and information summaries related to public interest technologies, ethical tech, responsible innovation, responsible tech, responsible AI, trust and safety, digital citizenship, and tech for good.

Note: If someone forwarded this newsletter to you, please subscribe:

The Reproducibility Problem in Science

While it is undeniable that science provides many benefits in society, science is not perfect, and it is exactly because we place so much value and trust in scientific results that we must be careful with how we do science and assess its results. For example, one current problem with science across many different domains is something called the reproducibility crisis. Essentially, the problem is that many published scientific results–results that are, for example, used to justify new medical drugs, new products, new medical procedures, new tenured faculty members, new grants, and new companies–cannot be reproduced, meaning an independent scientific team cannot replicate the results and claims of the original study.

As a concrete example, in a study trying to reproduce the findings in 100 experiments published in top-tier psychology journals, researchers were not able to reproduce most of the results (Bohannon 2015). Similarly, in a Nature survey of 1,576 researchers from many different science disciplines,

more than 70% of researchers have tried and failed to reproduce another scientists’s experiments, and more than half have failed to reproduce their own experiments (Baker 2016).

[Edit: I’ll put the following in italics to highlight why this is a big deal] Yet reproducibility is widely held as a pillar of science: “Replication is the criterion for truth, for proof of causation, in science” (Staddon 2017, 27).

To add to the above, especially as a result of Pseudodragon Newsletter #1 “Hello, World!”, where I emphasized the desire to help steer especially machine learning and artificial intelligence technologies in a more humanistic, ethical, and responsible direction, in this newsletter I wanted to dive into the reproducibility problems with using machine learning in science.

Note that “machine learning in science” is a subset of the many domains in which machine learning and artificial intelligence are being used today. First, researchers, often in computer science and computer engineering, work on creating the machine learning and artificial intelligence algorithms themselves. Typically this work is theoretical, with many “real-world” constraints and concerns ignored or abstracted away. Other researchers, often with interdisciplinary strengths, then provide critical assessments for when and how these algorithms are used either in science, in products, or in infrastructures from the perspectives of fairness, justice, transparency, bias, and other unintended consequences. Other researchers try to combine algorithms to win at ML/AI modeling contests and challenges, such as those sponsored by kaggle,1 drivendata,2 AIcrowd,3 and zindi.4 And engineers often apply those algorithms to create products and processes. Each of those domains has reproducibility problems with how they develop and use machine learning–here we are just focusing on using machine learning in science–using machine learning results as evidence for the creation of new scientific knowledge.

The Pressure and Increasing Use of Machine Learning in Science

For a variety of reasons, the use of machine learning and artificial intelligence in science has been increasing. Just as one random recent example, the Department of Energy awarded $3 million for the use of AI and machine learning to process large datasets and to achieve better outcomes in situations where there is little data.5 While the cynical researcher’s joke is that you just update and resubmit your past grant applications by substituting “machine learning” for “statistics,” the result is that because science research and the faculty tenure process is very dependent on receiving grant funding and research publications, machine learning and AI methods will be used, whether or not those approaches are appropriate or the best for the research being conducted. Another issue is that because there is an overabundance of machine learning packages, libraries, and examples just a search engine click away, scientists may be copy/pasting and using code and algorithms without fully understanding the code, how that code may adversely interact with their data, and how to interpret the results. Finally, researchers are under tremendous pressures to publish results as conference and journal papers (“publish or perish”), meaning if there is a choice between submitting early results for publication or taking extra time to perform due diligence analysis to confirm those results, profession incentives give weight for the former choice to be made. Thus, not only has the use of machine learning methods in science been increasing, but systemic factors exist that may exacerbate the reproducibility problem.

The Reproducibility Crisis in Machine Learning-Based Science–Data Leakage

Sayash Kapoor and Arvind Narayanan in their recent paper “Leakage and the Reproducibility Crisis in ML-based Science” (Kapoor and Narayanan 2022) take a look at just one issue that can cause reproducibility problems–data leakage. But to understand data leakage, let me first sketch at a high level how machine learning models are trained and assessed. In machine learning, we “train” models using a set of data known as the “training dataset.” Each model comes with a multitude of parameters that can be adjusted to produce different results. The purpose of the “training” phase is to “dial in” those parameters so the model can perform as well as it can on the training dataset. Imagine the training dataset like the collection of resources a student can study when preparing for an exam.

Once the model has been trained, the performance of the model can be assessed by evaluating the model using a “test dataset.” This evaluation of the model using the test dataset is like the final exam for the student. Without getting into too many details, just know that one key rule in machine learning is that you are not supposed to mix the training and test datasets. For example, if you include data from your test dataset in your training dataset, then you are effectively giving to the student the questions and answers that will be on the final exam ahead of time.

Is it cheating when the teacher gives the student the questions and answers on the final exam to study ahead of time?

The point is that mixing the training and test datasets can make any resulting model’s test scores falsely seem too good.

And now we are ready for data leakage, which Kapoor and Narayanan describe like this:

Data leakage is a spurious relationship between the independent variables and the target variable that arises as an artifact of the data collection, sampling, or pre-processing strategy. Since the spurious relationship won’t be present in the distribution about which scientific claims are made, leakage usually leads to inflated estimates of model performance.

Because the model falsely is evaluated as being better than it actually is, other researchers would not be able to reproduce those false results–the scientific claims would not be reproducible.

You may think that scientists would be smart enough not to make a data leakage mistakes like that. But again, recall our discussion above about the pressures to use machine learning methods in their research even when not fully understanding machine learning, as well as the pressures to publish “good” results as quickly and as often as possible.

To assess the problem of reproducibility in science and with data leakage in particular, Kapoor and Narayanan surveyed machine learning-based research across 17 different scientific fields, including medicine, bioinformatics, neuroimaging, nutrition, software engineering, toxicology, histopathology, IT operations, neuropsychiatry, genomics, and computer security. They found data leakage errors as well as other modeling and procedural errors in all the fields, affecting in total 329 published research papers.

As a case study, they evaluated the results for a subset of papers and found that “the main findings of each of these papers are invalid due to various forms of data leakage” (Kapoor and Narayanan 2022, 3). In fact, in those papers, the original authors had concluded that their machine learning models significantly outperformed plain old statistics models (logistic regression, in case you were wondering)–after correcting for data leakage, Kapoor and Narayanan found that in fact those machine learning models do not perform significantly better than those “old” statistics models.

How to Make Science Better

Hopefully this newsletter has highlighted the problem of reproducibility in science, especially regarding the use of machine learning and artificial intelligence algorithms. Interdisciplinary researchers are working on solutions to address this and similar problems, but since these problems are systemic across so many fields in science, long-term efforts will be necessary.

And again, recall that the Kapoor and Narayanan focused on the reproducibility problem just in science–there are issues in engineering as well. While the engineering side of the house is more of my focus, I am also interested in the use of ML/AI in science. In fact, if this newsletter’s discussion about the use of and problems with machine learning in science is of interest to you, then you might like to know that there is an upcoming workshop on exactly that topic: the “Reproducibility Workshop” at Princeton (available online) July 28, 2022, 10AM–4PM ET.6 I hope to attend as many sessions as I can and look forward to reporting back in a future newsletter if I hear anything particularly interesting.

Yours,
Kendall

You’re on the free list for The Pseudodragon Newsletter. For the full experience, including access to The TechnoSlipstream Podcast transcripts, podcast episode early access, and other writings available only to supporters, join the community on our Patreon page:

Join on Patreon

Why not forward this newsletter to a friend? Thanks!

Feedback?

If you are a subscriber just reply to this email.

About

Just joining us? Or maybe you’ve forgotten why you signed up? I’m Kendall Giles, a writer, researcher, and drinker of much coffee. Currently I work at Virginia Tech in the Department of Electrical and Computer Engineering in the College of Engineering in Falls Church, Virginia. I also teach in the Master of Information Technology Program, teach in the ECE Master of Engineering Program, and am a PhD student in the Department of Science, Technology, and Society in the College of Liberal Arts and Human Sciences. I research, write, and speak at the intersection of science, technology, and society, including the TechnoSlipstream podcast and the Pseudodragon Newsletter.

Contact

Bibliography

Baker, Monya. 2016. “Reproducibility Crisis.” Nature 533 (26): 353–66.

Bohannon, John. 2015. “Many Psychology Papers Fail Replication Test.” American Association for the Advancement of Science.

Kapoor, Sayash, and Arvind Narayanan. 2022. “Leakage and the Reproducibility Crisis in ML-based Science.” arXiv. https://doi.org/10.48550/ARXIV.2207.07048.

Staddon, John. 2017. Scientific Method: How Science Works, Fails to Work, and Pretends to Work. Routledge.

https://www.kaggle.com

https://www.drivendata.org

https://www.aicrowd.com

https://zindi.africa

https://www.anl.gov/article/doe-grants-will-help-advance-ai-techniques-to-address-data-challenges

https://sites.google.com/princeton.edu/rep-workshop

2 Comments

Bret Andrew Van Hof

Nov 20, 2022·edited Nov 20, 2022

I would be less concerned about engineers using code they don't fully understand, it is a daily occurrence and a founding principle to abstract away and encapsulate. As long as the interface is good and the code works. It is stuff like this that worries me. Instead of adding more sensors, Tesla removes them and makes more promises they can't deliver, instead of overdelivering and under promising: https://www.forbes.com/sites/bradtempleton/2022/10/17/tesla-removes-ultrasonic-sensors-in-bold-move-that-cripples-features-but-promises-to-restore-them/?sh=52a8bcd14949 I imagine it is due to the complexity of the neural network model, which is likely one of the most advanced, it has lots of data entering from the Tesla drivers and sensors. Relying on Cameras alone is a dumb as it gets... bottom line. ...way unethical. Same with Lamda. It is interesting, sure, but very limited what they can demo. Ethics will be a large driver for chatbots too, I feel like the engineer claiming sentience of LaMda was a test of those waters.

Expand full comment

1 reply

1 more comment...

The Pseudodragon Newsletter