The truth is that today’s neural networks are fantastic interpolators but terrible extrapolators. They are powerful pattern matchers with the ability to contort themselves to fit almost any dataset, but their fitting is blind to the mechanisms that generate the data in the first place.
Neural networks do not yet engage, as humans do, in a creative search for explanatory theories that account for why the data is as it is. They also certainly don’t then go forth, as humans ought to, and steadily strive to falsify every last one of those creative theories until a single one emerges triumphant as the best explanation for the data observed.
A human scientist (like say, James Maxwell) makes predictions about things (like electromagnetism) by building, through an iterated process of conjecture and falsification (i.e. science), a deductive framework (Maxwell’s equations) that generalizes to future situations.
Deep neural networks, on the other hand, take a different approach to modeling reality. They stitch together thousands of linear functions, laboriously shifting each one slightly for each training example, into a kind of high dimensional quilt — a manifold — that fits the training set. In doing so, they cannot help but make predictions by inductively pattern-matching onto what they’ve seen happen before. They mirror (rather than explain, as humans do) the chaos and complexity of the phenomena they observe.
Neural networks arrive at their predictions by induction, not deduction.
This helps explain why today’s deep neural networks require so much data to learn anything useful: They are inductive interpolators and, as such, they require a large number of points between which they can do their interpolating.
Why is it that our progress toward solving hard AI problems slows down as soon as we enter the realm of edge cases? The answer has to do with the reach of explanations.
But it doesn’t yet explain, however, why the long tail, in particular, is often so problematic. Why is it that our progress toward solving hard A.I. problems slows down as soon as we enter the realm of edge cases? The answer has to do with the reach of explanations.
One of the most remarkable things about human science, to quote David Deutsch, is “the contrast between the enormous reach and power of our best theories and the precarious, local means by which we create them.” He says, “No human has ever been at the surface of a star, let alone visited the core where the transmutation happens and the energy is produced. Yet we see those cold dots in our sky and know that we are looking at the white-hot surfaces of distant nuclear furnaces.” Human thinking somehow has the power to tell apart what is fundamental from what is merely incidental and thus can generalize to the long tail without us having to experience it directly.
To be fair, the inductive model at the heart of a deep neural network also has some power of generalization, but there’s a catch. The key inductivist assumption is that the future will resemble the past and the unseen will resemble the seen. And true, sometimes it does. More often, however, the reality is otherwise. The world is radically nonlinear. One does not simply model it with a quilt of stitched-together linear functions, for, as it turns out, the future and the unseen are often unrecognizably different from the past and the seen. As a case in point, think of the inner workings at the center of a distant star whose light began its long journey towards Earth thousands of years ago. There is nothing that even remotely resembles the environment of that star here on Earth.
Without an explanation for why a pattern that so reliably holds in the common case should continue to hold in cases that are less common (the edge cases), a deep neural network’s strides become just as blind as they are confident when they venture into the long tail. The darkness of that realm, for a neural network, can be illuminated only by direct experience with it — real training examples drawn from the long tail itself that can help mold the network’s linear predilections into better fitting nonlinear ones. But, of course, those training examples are, by definition, outliers and the hardest to come across. Thus, after all of the low-hanging fruit is picked and the long tail is all that is left, the marginal cost of collecting new useful data points begins to increase.
After all of the low-hanging fruit is picked and the long tail is all that is left, the marginal cost of collecting new useful data points begins to increase.
Think of what it takes to train a large neural network today. As our applications for A.I. become more ambitious and our networks grow deeper and wider, the whole enterprise becomes primarily about tracking down — in the most unlikely of places — data that offers differentiated signal. This is why it’s a problem of complex coordination. Combing through enough of the long tail inescapably calls for the marshaling of an army that scours the world for the right bits of rarefied, useful data.
Our neural networks today are architected and trained via the top-down, hierarchical efforts of a group of people that invariably work for the same company. At X, for instance, my team and I were on the hook for everything from scoping the ambitions of the project, specifying the architecture of our networks, tuning their parameters, building our robots from the ground up, and babysitting them (below) as they collected petabytes of data.
And let’s not forget, our hivemind was merely a proof of concept. It took four robots training simultaneously on different doors for hours on end — learning from trial and error — to achieve a 95% success rate on all variations of just those four doors.
Imagine what it would have taken to get it to truly work across all doors under all conditions, let alone venture beyond the realm of just opening doors. For all of its scale, even Google has struggled to mobilize enough resources to cover the long tail for things like self-driving cars.
The robot hivemind we built at X was cool — or so thought the eight-year-old boy inside me. The great irony, however, is that a hivemind is supposed to emerge bottom up as a unified intelligence that integrates the countless impulses of each of its far-flung agents. The “hivemind” that I worked on wasn’t like that at all. Every aspect of it — every line of code written, every resource deployed — was controlled by us. It was a centralized system pretending to be a decentralized one — a somewhat “communist” simulacrum of a hivemind.
But what if such intelligence could actually emerge bottom up? What if it could be born, not from the efforts of just one company, but from the aggregate knowledge of countless people working independently from far afield to contribute diverse signal to the collective?