Infusing Imagination Into Intelligence

Where text-to-image models get their creativity

The news cycle has been dominated by large language models over the last month. The breakneck speed of LLMs research, the battle between competing tech giants and debates about their risks and safety can be exhausting. If you feel the same, then I have some fun respite for you.

One of my early posts on diffusion models was “What makes StableDiffusion and Dalle-2 so good?”. In it I focused on the why diffusion models produce much higher quality images than other generative models such as GANs and VAEs. Last month, I wrote “Is my AI art original or copied?”. In this post I covered a practical experiment developed by Berkeley researchers to test whether individual images produced by diffusion model were original or copied from the training data. While writing about some newer developments, I realized there was a much more fundamental intuition yet unexplored, namely - “Do text-to-image models have imagination? Why/why not?”

Short answer, Yes.

Long answer, AI does not have agency (not yet at least). So when I say they have imagination, what I mean is if you (the human) have an idea in your head, you can make the model imagine it for you. Lets unpack this claim.

Background

Fundamentally all generative models, be they VAEs, GANs or diffusion, have the same underlying 3 steps -

  1. create a training dataset of images. Our objective is to generate new images that would fit right in this set of images.

  2. set up a problem for the model to solve. The problem is designed such that the images in the training dataset are a valid solution to the problem.

  3. train the model to generate images that solve the problem.

Unconditional generation

An unconditional generative model generates images from the distribution it was trained on. For example if you trained your model on images of dogs, the model will generate images of dogs. The model just spits out a random image each time with no way to control it.

Conditional generation

Now, probability distributions can be sliced up into conditional distributions. For example, the distribution of animals can be split into dogs, dragons and other animals. Text-to-image models use this property. During training, a text label is provided along with the images. Images of dragons are paired with captions indicating dragons and vice versa. Now the model learns multiple conditional distributions, it associates the text "dragon" with the distribution of dragon images and the text "bulldog" with bulldog images. So far so good.

Turns out both the unconditional nor the simple conditional model are not that imaginative. Both these types of models only produce images very much like their training dataset.

Intuition - Constraints can set you free

The real magic lies in not directly learning the text labels, but instead first passing the text label through a large language model and learning the embedding of the text label.

LLMs know things.

Since the LLM has been trained on large datasets which contain Wikipedia and other knowledge bases, the LLM knows what dragons and bulldogs are. What I mean by know is that instead of treating words as independent, the LLM associates dragons with hundreds of other words such as scales and wings and bulldogs with sourface and skin folds. When you give a word input to the LLM, the LLM produces an embedding vector that encodes the relationships of this word to other words.

By using this embedding instead of the text label, we are replacing a single word label "dragon" with essentially a representation that contains everything there is to know about dragons. It is like training each image with hundreds or thousands of descriptive conditional labels. Now, instead of learning two independent distributions, one for dragons and one for bulldogs, the model learns one really complex distribution for all images. This distribution internally represents what the features of dragons and bulldogs are, what features they share with each other and what makes them unique. As a result, our learnt distribution is made up of millions of overlapping conditional distributions corresponding to all kinds of features and attributes such as short, big, furry, scaly etc.

Imagination unlocked

When there is no way to direct the model, it will default to producing the well populated regions of the distribution, i.e. the ones it sees during training. But with a text conditioning, we can easily push the model into novel regions of the distribution. For example, you could prompt the model for “a mythical beast with the face of a bulldog and the body of a dragon, something the model has never seen before. Since the language model understands the text prompt (i.e defined by its relationships to other words/sentences) it can turn these ideas into embeddings that makes sense to the diffusion model. As a result the diffusion mode can imagine images that never existed in the training dataset.

In summary, for any individual prompt, the text condition acts as a very strong constraint forcing the model to produce specific type of outputs. However, it is this interaction between the language and image model which unlocks access to completely new areas of visual imagination.

Does this qualify as imagination?

I think it does. It is my subjective opinion but this seems very similar to how human inspiration and imagination work. When you read a book, the text describes the scene. However if you have an active imagination, you can picture the scene in your mind. The imagination preserves the information in the text and makes up the remaining detail. These extra details that were not explicitly present in the text are usually inspired from previous life experiences or movies you has watched before which serve as your mental training data. Sounds like text to image generation to me.

Think about it. Have we solved visual imagination?

————————————————————————————————————
There you have it, my intuitions for where text-to image generative models get their incredible imagination. For more intuitions on AI/ML, subscribe below and follow me on Twitter. You can also check out my other blog and projects on nirsd.com.