Why do DallE/SD produce beautiful images but garbled text?

Teaching image models to talk.

In 2022, shortly after DallE-2 was released, people noticed that DallE-2 was terrible at spelling. For a model that can produce breathtakingly beautiful photographs to not be able to spell a single word is strange. This is not just a problem with DallE-2, StableDiffusion and other generative models have the same issue. Here is StableDiffusion trying real hard to spell "vegetables" on a sign.

Gibberish has meaning!

As it turned out, DallE was not just throwing out random text. Alex Dimakis (an excellent MLfluencer to follow) and his student, Giannis Daras, took the gibberish text output that DallE produced and fed it back to the model as input text prompts. The model interpreted the gibberish as completely legitimate text with consistent meaning. According to DallE "Wa ch zod ahaakes rea" means shrimp and fish. So, did DallE just create a whole new language for itself?

DallE has its own gibberish language.

Deciphering the secret language

Giannis Daras and Alex Dimakis published this finding in a short arxiv snippet titled "Discovering the hidden vocabulary of DallE-2". Using the same method of feeding DallE-2's output back in as input, they tried to guess the meaning of the gibberish. They found many interesting phrases like this example shows.

So what is going on? This is a fun and actually still unsolved mystery. We do have a working theory though.

Intuition 1 - Spaces are continuous, concepts are not

Models store information in sequences of floating point numbers called latent or embedding spaces. Every concept that the model has learnt is represented as a regions in this space of numbers. Continuous latent spaces means there is no clear categorical barrier between two objects. However, in most cases, concepts are discrete. If the model is trained well, the average embedding of 'birds' should be very well separated from the average embedding of 'bees' since they are two distinct concepts. But in a high dimensional space, we can usually find a region where the birds cluster is adjacent to the bees cluster. Then at the boundaries of the two concepts you will have regions of embedding space which will mash the two concepts together. Concepts need not be mutually exclusive for example, 'blue birds' must lie in a region which overlaps with the concepts of 'bird' and 'blue'.

One hypothesis is that these generative models mash up words and phrases that represent the concept it wants to output, i.e. the gibberish is actually a mashup of known words.

But this raises more questions, a) why doesn't this mash up happen to generative text models such as GPT-3 and b) why are the mashed word's not at least partially understandable and c) why doesn't the model produce mashed up images, for example, birds-bees hybrids? All these question have the same answer - decoders.

Intuition 2 - Embeddings need decoders

Language models don't understand words per se, instead they map tokens to concepts. You can visualize OpenAI's tokenizer used to train GPT-3 and DallE, where they use colors to show how words are broken up into tokens.

Small words can sometimes be a single token, but most big words get split into many tokens.

Each step of the transformer input and output corresponds not to a word, but to a token. So that is a first clue, it is likely the model is not mashing up words, but tokens, many of which could be unrecognizable fragments of words.

Secondly, the concept the model is trying to spell out is not necessarily something we can recognize. For example, consider the previous example above where "Apoploe vesrreaitais" seems to mean 'birds' to DallE. Turns out that is not accurate. The same prompt "Apoploe vesrreaitais" in different contexts can produce images of birds or insects.

So clearly, "Apoploe vesrreaitais" does not mean 'birds'. It might mean something between bird and insect, or perhaps something that flies with wings or something more inscrutable. Therefore the mashed up words and tokens could mean a mashup of concepts that may not correspond to any discrete concept that we recognize.

These two facts together is the reason language models such as GPT-3 are made up of encoder and decoder stacks. The encoder's job is to break the input language into tokens and map them onto concepts in the model's embedding space. The decoder's job is to take the embeddings and convert them back into human understandable language. But DallE and StableDiffusion have no language decoders. Both are built with a language encoder combined with a image generation diffusion model. The language encoder encodes the input text into embeddings and then the diffusion model acts as an image decoder decoding the embeddings into beautiful images. However, the image decoder is trained to produce clean aesthetically pleasing images, that is why we get images of either bird or bee which humans like and not monstrous bird-bee hybrids. The image decoder is not trained to produce text in any language. The only reason we get anything that looks like text is because the training dataset for DallE must contain some images which have text in the image. Still, the image diffusion model makes a poor language decoder and spit out gibberish that only superficially looks like text.

Can we fix this?

Well sort of.

So far we have no principled way to combine a dedicated language decoder and a dedicated image decoder. But who needs principles when you can just throw more data and compute at the problem!!

A team at Google trained Parti, another generative text to image model, and showed that by training bigger models with more data, eventually the image decoder becomes a decent language decoder. The prompt "A portrait photo of a kangaroo wearing an orange hoodie and blue sunglasses standing on the grass in front of the Sydney Opera House holding a sign on the chest that says Welcome Friends!" produces the following images for 4 different model sizes.

As we see, Parti gets better at decoding text in images as the size of the model gets larger.

To be honest, I have my doubts if a language decoder trained through images can be as good as a one that is trained on language. My guess is that Parti is probably better than DallE but still can't make great text output like GPT-3. We don't know for sure because Parti is not open to the public. So making convincing text in images is likely still an open research problem.

————————————————————————————

There you have it, my intuitions for why DallE-2/StableDiffusion produce text that looks like gibberish which is in fact the internal encoded language that every model has to invent to name its own concepts and information. For more intuitions on AI/ML, subscribe below and follow me on Twitter.