Is my large language model too large?

Scaling for the win!

When we talk about the biggest neural networks, we are inevitably talking about language models. The reason is data. Small models don’t have enough capacity to learn everything that large datasets have to offer. However, using large models on small datasets is also a waste of compute and time, the model will simply memorize the dataset and likely overfit. Therefore the largest models can only be built on the largest datasets. Datasets of images and audio are small because such data is usually under copyright and often have privacy concerns. Text is the default data type of the internet and it is legal to literally crawl the entire internet and collect all the text one can find. Therefore it is cheap to make giant internet-scale text datasets ripe for massive models to learn. But how big should our network be so that it is not too small, not too big, but just right? That is what we will discuss today.

In 2022, Deepmind published what is known as the Chinchilla paper all about ‘scaling laws’ - the rules that determine optimal model and dataset sizes. Deepmind wasn’t the first to realize that there was something here, in 2020, researchers at OpenAI published a paper on scaling laws. However Deepmind found a key intuition that led them to a surprising new result.

Intuition : Optimal learning rate depends on data size

The OpenAI paper tested model performance as a function of dataset size, model size and total compute (model size x dataset size). They kept learning rate constant over all the experiments. But Deepmind found that this significantly underestimates the performance. Using a cosine learning rate schedule where the learning rate changes in proportion with the dataset size gives the best performance for all models regardless of size. Therefore fixing the learning rate decreased the effectiveness of dataset size in the OpenAI analysis. This led to an erronneus conclusion by OpenAI folks namely, given a fixed amount of compute, scaling model size is twice as important as scaling dataset size.

OpenAI analysis showing that as compute budget increases, the performance curves gets flatter meaning larger model sizes have better performance. Note OpenAI only tested up to around models with 1B params.

By testing models up to 20B and using variable learning rates, Deepmind showed that the curves never flatten even for large models, i.e. for a given compute budget, performance is worse (loss goes up) both for smaller and larger models than the optimal size.

Chinchilla

The central claim of Deepmind’s paper is very simple - all contemporary large language models (LLMs) are trained sub-optimally. The models are too big and the data sizes too small.

Once you correct for variable learning rate, turns out scaling dataset size is just as important as scaling model size. To prove this, Deepmind trained a new model named Chinchilla, where they increased dataset size 4x and decreased model size 4x. Summarized in one image -

They show that despite being much smaller in model size, Chinchilla beats the state-of-the-art models on most benchmarks.

Why are scaling laws important?

How much do you think it costs to train a big neural network?

Go ahead, take a second and guess.

Training GPT-3 once cost OpenAI 4.6 million dollars!!

In their paper, OpenAI mention that despite having a couple of mistakes in the training process of GPT3, they decided not to train another model due to the prohibitive cost.

Deepmind themselves trained Gopher, which at a size of 280 billion parameters to GPT3’s 175 billion parameters, must have cost even more.

It is too expensive to tune or retrain such models multiple times with different parameters. So you have to train a bunch of smaller models, understand the relationships of performance to parameters, select your optimal parameters and then take your one shot at training your optimal large model. So the right scaling law is literally worth millions of dollars.

Moreover, Deepmind’s finding is good for designing practical LLMs. The data requirement is only during training, the model size however lives on even after training as compute requirements during inference and usage. Therefore smaller models means they can run on cheaper machines and cost less to run. So while ChatGPT is all the rage these days, we should expect that a model of the same size as ChatGPT but trained on a lot more data should cost the same to run but actually perform much better.

We can look forward to scaling laws for image and audio models in the future. Will they have the same behavior as language models? We just have to find out.

————————————————————————————————————

There you have it, my intuitions for how understanding scaling laws can help us train smaller, cheaper and better models. For more intuitions on AI/ML, subscribe below and follow me on Twitter. You can also check out my other blog and projects on nirsd.com.