The perpetual perceptual problem

Justin Dulay

December 2nd, 2021

Introduction

In a conversation the other day, I thought of this following idea: perception is an exceedingly complex process that has been 'pre-trained' for millions of years in the making. We can brute force our way through problems, but already find that the limits of computing don't approach biological sytems.

Machine Learning has seen unprecendented gains in knowledge-based in tasks in recent years but still fails to generalize on perceptual tasks in many cases.
Human and animal perceptual intelligence have evolved and 'pre-trained' for millions of years - a big head start on our artificial systems.
Therefore, we cannot build an generally intelligent perceptual system without some special means.

Deep Learning Surpasses Human Intelligence in Some Tasks

Despite all of the hyped superiority and indeed even promise of large-scale deep learning systems, there remain fundamental drawbacks to the success of generalized artificial intelligence. Consider a task that deep learning excels at: chess. In 2016, DeepMind released AlphaZero, an unparalleled chess engine that beat the standard chess engine with a record of +155 -6. When paired against the greatest human grandmasters, AlphaZero still wins nearly every time. Now consider the intuitive task: after a chess match in which you begrudgingly accept the defeat, you walk outside for a breath of fresh air. You notice some birds flying from tree to tree, and you observe cars moving along the busy street down the entryway. You perceive accurate color representations of every object in this field of view, recognize the pertinent danger of moving vehicles, and understand their movement capacities are controlled by human drivers. Your supportive friend steps outside with you and asks how you're doing. You opt for the socially appropriate I'm good reaction, despite the emotion surrounding your loss and the internal perception to adhere to social norms as not to disrupt the delicate relationship between your relative truth and the objective reality surrounding your daily existence.

That's a lot to unpack for around five seconds of interaction. Your multimodal perceptive abilities of the recognition of drivers, perception of the color constancy of a bird flying between the sun and the shade, and the complex linguistic interaction of choosing a socially responsible response towards your friend despite your impulse illustrate the complexity of unpacking multimodal perceptual inputs from your senses. This translates to encoding their signals into neurological information, understanding the nuances of a latent set of information, and enacting on this knowledge. This process occurs thousands of times per day, every day, for your entire life.

Perception Poses a Different Type of Challenge

In 1987, the Carnegie Mellon University Professor of Robotics Hans Moravec posited that artificial intelligence renders reasoning to be a simple feat while simultaneously erroneously failing to perform on perceptual tasks. Moravec's Paradox proposes that A.I. as it stands (even in the late 80s, this component of the argument still holds today) fails to account for the myriad of complex inputs and perceive them to account for appropriate contextual outputs. Deep learning interpolates data points well on small dimensionality; but given the high dimensionality associated with complex perceptual inputs, deep learning systems must extrapolate data, which often fails given its limited brevity and depth of understanding of the nuances that remain indescribable to even the most conscientious human observer.

This arrives at a key observation that the paradox brushes upon - there are two facets to this idea: (1) Humans and cognitively mature animals have spent millions of years evolving and modifying the perceptual capacities, and (2) humans spend their entire lives evolving and enhancing their perceptual abilities.

Humans are perceptual agents with many minute abilities to perceive the world. We see this recurrently with changes among animals throughout evolution. Simple animals of times ~ 60 million years ago developed embedded eyes into their exoskeletons. Today, we see 30,000 hexagonal platelets on the eyes of fruit flies. While these animals, past or present, cannot perceive rich color information the way humans or primates can, they still can perform simple object and edge detection tasks. It remains tempting to compare these primitive eyes to today's convolution neural networks. After all, they can both perform basic edge detection and recognition tasks. Furthermore, neural networks, when trained on the right data, can outperform humans on highly specific task, such as quickly counting all of the faces in a frame. However, machine sensors, when taking in the same visual inputs that you or I or the fruit fly would observe, the artificial agent would fail to encode generalizeable information present in past memory. Even with reinforcement learning (admittedly a topic I need to explore and write about more), which rewords the learning agent with successfully completing tasks on partially unknown data still fails to generalize on drastically different tasks. In contrast, the attuned eyes of Homo sapiens have been evolving and fine-tuning over epochs (pun intended?) longer than any artificial system. While, of course, we can supercharge the training process of an artificial learning agent, we certainly don't have the compute power to create an agent that perceives at the granularity of a biological system and retrieves relevant information as needed.

A Side Note

Quantum computing could offer a potential solution to the compute power needed to create an artificial agent with the perceptual skills of a biological agent. We have seen that when we scale up on models (think Foundation Models), we can generalize to linguistic representations by brute force. However, even now, large language models cost millions of dollars to train. Unless we can more efficiently train models of perception or RL, we will run into computer infeasiblity problems. Likewise, training models is also harmful to the environment in the long run. However, even if we cannot create a purely optimal system for generating such a model in quantum, if we can 'brute force' our way through it by creating a big enough quantum system to accomplish such a task; well, let's see.

However, we shouldn't treat our modern research as moot if this is the case. Again, artificial intelligence is wonderful at complicated processes. As a collective, we benefit whenever a neural network can recognize cancerous cells in patients more effectively than a trained eye, identify fraudulent financial transactions, or translate languages into others. We shouldn't worry if we cannot achieve a masterful intelligence yet - we can do wonderful things with what we have.