Underspecification is a known issue in statistics, where observed effects can have many possible causes. D’Amour, who has a background in causal reasoning, wanted to know why his own machine-learning models often failed in practice. He wondered if underspecification might be the problem here too. D’Amour soon realized that many of his colleagues were noticing the same problem in their own models. “It’s actually a phenomenon that happens all over the place,” he says.
D’Amour’s initial investigation snowballed and dozens of Google researchers ended up looking at a range of different AI applications, from image recognition to natural language processing (NLP) to disease prediction. They found that underspecification was to blame for poor performance in all of them. The problem lies in the way that machine-learning models are trained and tested, and there’s no easy fix.
The paper is a “wrecking ball,” says Brandon Rohrer, a machine-learning engineer at iRobot, who previously worked at Facebook and Microsoft and was not involved in the work.
Same but different
To understand exactly what’s going on, we need to back up a bit. Roughly put, building a machine-learning model involves training it on a large number of examples and then testing it on a bunch of similar examples that it has not yet seen. When the model passes the test, you’re done.
What the Google researchers point out is that this bar is too low. The training process can produce many different models that all pass the test but—and this is the crucial part—these models will differ in small, arbitrary ways, depending on things like the random values given to the nodes in a neural network before training starts, the way training data is selected or represented, the number of training runs, and so on. These small, often random, differences are typically overlooked if they don’t affect how a model does on the test. But it turns out they can lead to huge variation in performance in the real world.
In other words, the process used to build most machine-learning models today cannot tell which models will work in the real world and which ones won’t.
This is not the same as data shift, where training fails to produce a good model because the training data does not match real-world examples. Underspecification means something different: even if a training process can produce a good model, it could still spit out a bad one because it won’t know the difference. Neither would we.
The researchers looked at the impact of underspecification on a number of different applications. In each case they used the same training processes to produce multiple machine-learning models and then ran those models through stress tests designed to highlight specific differences in their performance.