How foundational are the Foundation models?

Ready, Set, Generalize!

When asked “what are foundation models”, ChatGPT says -

Foundation models refer to pre-trained deep learning models that can be fine-tuned or used as a starting point for creating new models for various natural language processing tasks, such as sentiment analysis, text classification, question-answering, etc. Examples of such models include BERT, GPT-2, and ELMO, which have been trained on vast amounts of data and can be utilized as building blocks for creating customized NLP models.

Wikipedia describes foundation models as -

a large artificial intelligence model trained on a vast quantity of unlabeled data at scale (usually by self-supervised learning) resulting in a model that can be adapted to a wide range of downstream tasks. Early examples of foundation models were large pre-trained language models including BERT and GPT-3. Subsequently, several multimodal foundation models have been produced including DALL-E, Flamingo, Florence and NOOR.

The term was recently popularized by The Stanford Institute for Human-Centered Artificial Intelligence's (HAI) Center for Research on Foundation Models (CRFM). The CRFM blog describes foundation models as -

...models trained on broad data (generally using self-supervision at scale) that can be adapted to a wide range of downstream tasks. These models demonstrate surprising emergent capabilities and substantially improve performance on a wide range of downstream tasks.

Stanford got a lot of criticism for trying to corporatize research, put some companies work on a pedestal and control the direction of future research. I won’t say much else about this debate, it was overblown ado about nothing. Instead let us zoom in on the intuition of what makes a foundation model.

Is my model foundation?

From the above definitions, there are many soft criteria but no hard criteria to consider a model ‘foundation’. Let me add some numbers -

  • Large model - So the definition of large changes quite quickly.

    • BERT (2018) - 110 million parameters

    • GPT3 (2020) - 175 billion parameters

      Today it is safe to say you need at least a few billion parameters to be considered large.

  • Large dataset - You need at least a few TB of data to be considered large.

    • GPT3 was trained on ~500 billion tokens (text fragments). For context all of Wikipedia is on 3 billion tokens.

    • DallE2 was trained on 250 million images. Reportedly OpenAI discarded millions of noisy/poorly labeled images.

  • Self-supervised objective - supervised models only learn features that are necessary to predict the labels. To maximize generalization, foundation models are trained with label-free objectives such as next token/sentence prediction or text-to-image.

Intuition 1: Inference is the real criteria, not training

Anyone who has trained a model before knows you can satisfy all of those criteria and still get a garbage model out. The criteria are just suggestions of how the current crop of foundation models was built.

So I will reframe the whole concept. Given a model that already exists there can actually be only 1 real criteria for calling a model ‘foundation’.

If a model transfers to new tasks in its domain at/near state-of-the-art performance (with/without finetuning), it is a foundation model. 

GPT3 is considered state of the art for the following language tasks -

  • Language Modelling (Completing the last sentence of a short paragraph)

  • Closed Book QA (answering Trivia questions)

  • Language translation from English to multiple languages and back.

  • Winograd-style tasks (matching pronoun-noun pairs in ambiguously worded sentences)

  • Common Sense Reasoning (answering questions about how the physical world works)

  • Qualitative tasks (solving arithmetic, word scrambling and manipulation, SAT analogies)

  • Reading comprehension (Answering questions about an essay)

  • Grammar correction

  • Turing style tests (generating articles that are indistinguishable to humans from real articles)

What’s more it is the state of the art both without any finetuning (zero-shot) and with a few finetuning examples (few-shot). In the last 2 years, some models have beaten GPT3 on individual tasks, yet no single model can generalize to as many tasks as GPT3. That is the real reason GPT3 is deemed to be a foundation language model.

So if you have a revolutionary new model that only uses 20 parameters and beats GPT3 on all or most of these benchmarks, congratulations on creating a foundation model. (and on your Nobel Prize/billions of $)

Intuition 2: Are they really foundational?

In one word, no.

Given that in 2 years, BERT was beaten on all language benchmarks by GPT3 which was more than 1000x the size of BERT, it is clear that the models themselves are not a stable ‘foundation’ on any metric. Rather, we should understand foundation models as the set of models which serve as the benchmark for generalizability. As new models are built which beat the latest benchmark, they will replace them and become the new foundation models.

Researchers who are studying foundation models therefore are not studying a specific model or method. Rather they are trying to understand what makes some models generalize better than others and how they can be improved further.

Conjecture: What if we keep scaling models bigger

Recently, Deepmind published a paper claiming that language models have room to keep improving with more compute and data. (More details on this in a future post)

It is not impossible that we may find that scaling potential for these large models go beyond the budget and resource of any one company. If this is true then we might see large international collaborations, similar to CERN and International Space Station, where nations pool their money and manpower to train giant models. Individuals and companies could then connect to this planet-scale “foundation” model and finetune it for their purposes.

————————————————————————————-

There you have it, my intuitions for how to define foundation models and where such models are headed. For more intuitions on AI/ML, subscribe below and follow me on Twitter.