Artificial Intelligence

Spotting Exaggerated Claims for AI Superiority Over Experts

A cognitive primer for putting machine learning assertions into perspective.

Posted February 13, 2024 | Reviewed by Hara Estroff Marano

Key points

Many artificial intelligence developers are prone to dismiss human expertise or misunderstand it.
Artificial intelligence/machine learning proponents tend to overstate the strengths of machines over humans.
AI/ML projects can be useful for augmenting human expertise.

This post aims to provide readers with some ways of evaluating unwarranted claims about machine learning (ML). A friend of mine recently sent me an article that described yet another ML triumph.

In reading the article my first reaction was a feeling of deflation: The AI community is continuing to make giant strides, and my skepticism places me on the wrong side of history.

However, as I got into the article, I rebounded. I now think the article is a wonderful example of ML over-reach and can teach us about dissecting the claims of ML developers.

Background: Machine Learning for treating cardiac problems in the emergency department (ED)

Mullainathan and Obermeyer (2022) used ML to study how physicians diagnose cardiac problems (e.g., blockage in the coronary arteries) that might lead to heart attacks.

The researchers sampled 246,265 emergency visits at a large, top-ranked hospital, tracking tests given, resulting treatments, and subsequent health outcomes. The researchers then trained a ML/algorithmic model to predict the outcome of the testing.

The ML model used only the information available at the time of the testing decision. The algorithmic model computed the probability that a patient would have a heart attack within 30 days of being examined. The algorithmic model was fed a large amount of data from the electronic health record, including patient demographics, diagnoses, procedures, laboratory results, vital signs.

The ML model eventually grew to 16,405 different variables. The researchers found that the ED physicians ordered too many tests for patients with a low chance of having a heart attack but ordered too few tests for patients with a high likelihood of having a heart attack in the next 30 days. Thus, the physicians were inefficient at both ends, wasting time, effort, and money with the unnecessary testing for the low-risk patients and risking the lives of the high-risk patients by failing to test them adequately. The physicians were making systematic errors in judgment, demonstrating biases such as availability and representativeness.

That’s pretty intimidating. How can you argue with a ML/algorithmic model containing 16,405 variables? How can you disagree with a data set of 246,265 emergency room visits?

Tips for Assessing AI/ML Demonstrations

I have no background in economics, or in the heuristics and biases paradigm, or in medicine. However, I do have some experience with naturalistic decision-making and expertise. So here are the lessons I want to share about how to assess projects like this.

First, pay attention to a learning confound. The ML model was designed to learn from the data. But the physicians never got any chance to learn. This doesn’t seem like a fair comparison. ED physicians actually don’t get very much feedback on the patients they see that shift. Test results may not arrive until the next shift. Many patients aren’t diagnosed during their stay in the ED. They are released or admitted to the hospital and the ED physicians don’t learn what happened unless they take the effort to investigate afterwards—pretty unlikely. Physicians certainly don’t get a 30-day follow-up.

Second, don’t take “biases” at face value. In this study, the ED physicians were using the checklists and training and common knowledge they had, but they weren’t using the 16,405 variables that the ML algorithm had—variables the physicians were never shown. Of course the ED physicians were using available and representative information. These heuristics aren’t biases; they are generally useful. Without them the physicians would be helpless.

Third, look out for smuggled expertise. The ML algorithm used information from the electronic health record, and I suspect that a lot of this information reflected judgment calls based on the expertise of the staff. The ML algorithm was building on the expertise of ED physicians and staff, not substituting for it.

Fourth, don’t be cowed by big-data types of analysis. The ML algorithm grew to 16,405 variables. However, in predicting risk, the empirical optimum was 224 variables—still, a lot. Yet there was a plateau at only about 20 variables—not that many. Twenty variables seems well within the range for training ED physicians to be more accurate.

Artificial Intelligence Essential Reads

AI Enables Virtual Behavioral Neuroscience

AI Performance Enhanced With Human Developmental Psychology

Making Use of ML

There is much to admire in the article, including the ability of the authors to gather and analyze complex data. I was particularly encouraged by the last two paragraphs, in which the authors note that one cannot simply assume that the algorithm is correct if its results differ from the human predictions, especially when the algorithm is trained on data produced by the humans, stacking the deck against the humans (my third point above).

The last two sentences of the article note the value of using the algorithmic material to help train ED physicians, and I strongly agree. This is how ML efforts can be harnessed—by providing materials that can be used to train and support the physicians.

Unfortunately, the positive suggestions come at the very end, in contrast to the abstract and the rest of the article, which emphasize how the ML algorithms reveal physician inefficiencies and mistakes. My impression in reading the article (and I may be over-reacting) is that physician judgment should be subordinated to ML algorithms—the title is “Diagnosing Physician Error”— revealing a skepticism about human expertise. Why the preoccupation with ways that the ED physicians fall short without any attempt to document the strengths of the physicians, the degree to which they outperform random judgments, and examples of skilled diagnosis that go beyond the standard guidelines?

The Mullainathan and Obermeyer article has provides a set of dimensions for assessing ML/algorithm projects in general: Do they allow for learning in the human comparison condition? How good are the decision makers, not just how bad are they? How can we improve the performance of the decision makers instead of concluding that they should be replaced by the algorithm?

By increasing our own sophistication, we can better understand the claims the AI researchers are making and the issues they are ignoring.

References

Mullainathan, S., & Obermeyer, Z. (2022). Diagnosing physician error: A machine learning approach to low-value health care. Quarterly Journal of Economics, 137, 679-727.