Intelligence is Compression

But is compression intelligence?

Last week a paper was released on Arxiv that caused a big stir in the AI community. Researchers from University of Waterloo published a paper title Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors claiming that compressing sentences with a simple gzip compression (available on every laptop) and then training a k-nearest-neighbor algorithm (a classic ML algorithm, often the first one taught to students) on these sentences beats or matches top deep learning large language models like BERT on many benchmarks.

This might come as a shock to people outside the field, but I noticed that most ML experts online weren’t as much shocked as amused and curious to know more. How come this doesn’t flip everything we have been doing in machine learning? The reason is a deep intuition connecting intelligence and compression.

Intuition : Intelligence is Compression

Intelligence is the capacity to acquire and apply knowledge. It is how we understand the world. In theory, there other ways to understand the world, for example, by simulation or derivation. However these methods are not as powerful or useful as intelligence.

  1. Derivation - Let’s say we have a small set of first principles and we can predict any phenomena by deriving them from first principles. All the myriad complexity, everything that has happened, can happen and will happen can be known with perfect precision using the small number of initial conditions and laws that govern the evolution of the universe. Well right away, we see that this method only work for systems we have created; in the real world, we don’t know what the initial condition or first principles are.

  2. Simulation - Alternatively, we could create a model of the phenomena we are trying to understand, simulate different scenarios and thus be able predict how that phenomena will behave. This works, but this doesn’t scale. As we increase the precision of the required predictions, we would have to simulate the phenomena in greater and greater detail, until at some point, the only way to understand a thing is to build an exact copy of the thing that we control. This works for pendulums but does not scale to understanding tornadoes or galaxies.

Intelligence is different. Intelligence is the capacity of turning information into knowledge. Information can be either empirical data or derived/postulated principles or a combination of the two. Knowledge does not have a good non-circular definition, but for our purposes here we can says that if we can predict what an object will do in the future, then we “know” it. Thus intelligence takes information, extracts “knowledge” and then applies this “knowledge” to predict future information.

By this definition, knowledge is neither the underlying principle nor a full scale replica of the phenomena, though it can be either of those things in a specific instance. More generally, knowledge must be what is left after discarding information that is not useful to predict the future and retaining the information and the connections that are needed to “know” its future. Thus what intelligence does is filtering, or in other words, a compression.

Counterintuition : Compression is Intelligence

This is the real controversial claim. Intelligence is compression seems obvious to most experts, but is all compression some form of intelligence? For most people’s intuition, there should be more to intelligence.

Compression (of information) is the process of encoding information with fewer bits than the original. While intelligence seems like a mysterious thing that has many moving parts and many unknown unknowns, we know a lot more about compression. And extrapolating the compression ideas back to intelligence leads to some trippy and unproven ideas -

  1. We know, for example, that compression removes statistical or deterministic redundancy in information to create smaller packets of information. Thus finding patterns in information is compression. The human brain is very good at finding some types of patterns, such as repeated cycles, and very bad at spotting other types of patterns such as prime numbers. Is it possible to have alien or artificial intelligence that is good at spotting those patterns? Would they be superintelligent or are there some tradeoffs between types of patterns?

  2. We know that there is an upper bound to compression, that for a given information and a given encoding system, we can exactly calculate the smallest number of bits it is possible to compress it down to. Reversing the logic back to intelligence, does it mean that we can we calculate exactly how big a brain we need to understand an object (say the universe) even though we don’t know how to build that brain?

  3. We also know that we can surpass the Shannon limit as the above is called if we allow for lossy compression, where the original information is not perfectly retained. What information is retained and lost is then a function of the priorities of the compression algorithm and the characteristics of the input information. First of all, is that what having hazy, fallible memories is like? Secondly, is there a limit to such compression for our world? If there is, does that mean intelligence is bounded and if there isn’t does it mean intelligence can be infinite?

Update

As it turned out, after a week and further analysis, the results in the paper were found to be a bit overblown by a bug. Once the bug was fixed, gzip+knn still performed like a pretty good language model, but now instead of beating the large neural nets, it is trailing them. Check out this stellar piece of investigation into the paper by Ken Schutte for more info.

Conclusion

gzip+knn works by first making a lossless compression of the input text and then creating an embedding space where similar inputs are closer in this space than dissimilar inputs. This is almost exactly what neural networks do, except for two key differences -

  • gzip+knn has two predetermined independent processes (compression and embedding), in neural nets both are learned from scratch in one training process. The combined learning process should find optimizations for the dataset that predetermined processes cannot have.

  • neural networks don’t memorize their inputs perfectly, therefore are a lossy compression. As a result their reproductions are not exact, but this also allows their predictions to more imaginative and the amount of information they can store can be larger.

Given the similarities, we should that gzip+knn would certainly act like a language model. Given the differences, we should expect that even if it beat low-quality models, eventually deep learning models should surpass it. We should also expect that there might be certain tasks that gzip-knn perfectly solves, but neural nets should generalize to more tasks. When confronted with the news that gzip+knn beats a top model on some benchmark, we should update our beliefs about how good the top model is and/or how good of a test that particular benchmark is. But unless gzip+knn beats the top models on all the benchmarks, I don’t think it would shatter our understanding of machine learning. That is why while there was breathless coverage by AI influencers on some social media, most serious ML researchers expressed interest and curiosity rather than a reflexive dismissal of the results.

Ted Chiang (the excellent sci-fi novelist) wrote a New Yorker essay some time ago titled “ChatGPT Is a Blurry JPEG of the Web”. The essay is quite nuanced and definitely worth a read. The main point of the essay is that today’s LLMs are not intelligent and don’t capture the beauty, subtlety or even the veracity of the language they model. While I agree with that statement, blurry jpeg of the world has entered the lexicon as a general critique of all AI - current or future - with some people using lossy compression to mean the opposite of intelligent.

A view that might turn out to be quite ironic.

———————————————————————————————————
There you have it, my intuitions on how a surprising recent paper highlights a key intuition about the relationship of intelligence and compression. For more such intuitions on AI/ML, subscribe and follow me on Twitter. You can also check out my other projects on nirsd.com.