Perception or Distortion?

Why should we care about this?

Don't be fooled by the dense sounding names. This is one of the most interesting and counter-intuitive principles in computer vision. One can think of them as the tradeoff between memory vs imagination.

What if I told you that the better your ML model gets at getting each pixel value right, the worse it gets at getting the whole image right? That makes no intuitive sense, but somehow it is true, so let's see how its possible.

What is Distortion?

Say we have a dataset of captioned images and we want to build a model that takes the caption as input and predicts what the image looks likes. Then distortion is simply the dissimilarity between the true_image and predicted_image.

Dissimilarity can be measured in many ways. The simplest measures are pixel by pixel differences, for example, mean squared error (MSE) of the difference of each pair of pixels in true_image and predicted_image.

What is Perception?

The name says it all, perception is quality of an image as perceived by humans, i.e. the naturalness of the image.

The only "correct" way to measure perceptual quality is by human evaluation, but there are many mathematical proxy metrics that are highly correlated with human perception.

For example, a model trained to differentiate between natural images and computer-generated images can be used as a proxy for human evaluation. Then the probability of an image being natural as measured by this model could be used as a perceptual metric.

Improving perception or distortion?

Naively you might think a model that matches each pixel as closely as possible to the real images would produce the most realistic images. i.e. models trained to minimize distortion loss should have high perceptual quality.

But it turns out this only true up to a point. After that, images with the lower distortion look unrealistic and images with higher distortion look more realistic. In 2017, the paper "The Perception-Distortion Tradeoff" summarized the progress and made a surprising claim - that it is impossible to optimize both perception and distortion at the same time, regardless of how you choose to measure perception and distortion.

Intuition : How is this possible?

Framed in these mathematical ways, the perception-distortion tradeoff seems counter-intuitive. However once you train adversarial generative models such as GANs you can see this effect in practice. The intuition becomes clearer by making 2 observations.

Factor 1 : Distortion measures accuracy of image with respect to a reference image, while measuring perceptual quality needs no reference image. This means that it is possible to have perfect perceptual quality with terrible distortion, e.g. the model could produce very realistic and natural images but they just don't look anything like the reference image.

Factor 2 : For generative models, the input has less information than the output. Therefore the model has to learn to "imagine" new information. This means for any given input there can be many outputs with high perceptual quality but only 1 (matching the reference image) that minimizes the distortion.

From the above two factors it should be clear that models that maximize perception do not automatically minimize distortion. But can we explain why models that do minimize distortion can't have high perceptual quality?

The proof described in the paper is simple but technical. With some liberties of nuance, it can be paraphrased thus -

Let us assume, an ML model is able to generate the exact reference images for the exact inputs in the eval data, i.e. has low distortion loss. Then it is always possible to find a new input that lies in between two eval inputs such that the output image is the average of the two eval output images. But the average of two natural images is not a natural image and hence we get blurry unrealistic images. Thus models with low distortion loss cannot have high perceptual quality.

(Note : MSE optimized models yield blurry averages of reference images. Training using other distortion losses results in images that are unrealistic in other ways, not necessarily blurry averages, but the underlying principle of finding intermediate inputs is the same.)

Corollary : Memory vs Imagination

This principle is so fundamental to how vision models are trained that it enables some far reaching conjectures about human brains.

Question 1 : If low distortion == good memory and high perception = vivid imagination, does this mean memory and imagination are opposing forces?

Conjecture 1 :

For ML models, yes. For humans, maybe not.

Human brains very likely operate in the regime of very high perceptual quality and imagination. Most people can't remember things anywhere close to pixel level accuracy, so even the people with so-called photographic memory have very high distortion models for brains.

Question 2 : Human perception is perfect, so our memory should be 0. But it isn't. Does that mean the human brain is not an ML model and is not bound by this principle?

Conjecture 2 :

Let's do a thought experiment. Say if we had an very very large model and lots and lots of data and trained our model to maximize perception while keeping distortion below a large upper bound. The key thing to note is that the high distortion means the model cannot have an exact memory of the data at the pixel level, but it can have a very large memory that's close enough to capture all the important information but differ in minor ways from the reference.

The human brain is much larger and more powerful than current ML models, and carries the genetic memory of a billion years of evolution. The very-large-but-inexact-yet-close-enough describes human memory perfectly.

Now this implies there is a limit to the perceptual quality of the human brain. In theory, there should exist new sensory inputs to our brain that produce blurry or otherwise unrealistic outputs. However in practice, since we trained on almost infinite data and real world sensory inputs are constrained by the laws of physics, we might never encounter these truly 'new' data points.

Therefore it is still possible the brain is bound by the same principle and therefore our perception has a limit but we rarely ever experience this limit.

But that makes you wonder, what if drugs like LSD simply distort our sensory inputs into an unexpected range outside our memory making our generative brains hallucinate and see unnatural and unrealistic visions?

————————————————————————————-

There you have it, my intuitions for why computer vision models trade-off memory vs imagination and how this principle might apply to human cognition. For more intuitions on AI/ML, subscribe below and follow me on Twitter.