RLHF

How OpenAI turned GPT3 into ChatGPT?

Niranjan Sridhar
May 22, 2023

As the capabilities of deep learning models have increased over the years since ImageNet in 2012, the models have become more inscrutable over time. Deep learning has never had a satisfactory theoretical grounding, but at least shallow CNNs and RNNs made an intuitive handwavy sense. Giant transformers replaced all that engineered intuition with the brute efficiency of parallelized matrix multiplications. This would be fine if these neural networks always behaved as we expected, but they often don’t. Anyone who has trained a network knows that gradient descent pushes the NNs to exploit every loophole in your objective definition, every peculiarity in the dataset and every unforeseen shortcut in the training to achieve low loss, even if it means learning strange unexpected things that you never intended for the model to learn.

This has some people worried that even though we choose the data, the architecture and the training objective, we still cannot ensure that the machine learns what we want it to learn and not some other shortcut to minimize the loss. In the extreme case, the argument suggests that powerful AIs might form objectives beyond our control that may be opposed to humanity’s objectives. Thus is born the idea of AI alignment, a field of research dedicated to finding ways to test and build AI whose objectives are aligned with ours.

RLHF, or Reinforcement Learning using Human Feedback, is the biggest, and perhaps the only, advancement in the nascent field of AI alignment. It was published in 2017 by authors from OpenAI and Google.

Reinforcement Learning

To understand RLHF, we need a quick intro to RL. Typically RL involves an agent that can explore different actions and a reward function that rewards the agent for certain actions.

Reinforcement Learning can be useful when:

the optimal behavior is not known
sequential decision-making is required
the optimal end state is known.

A classic RL problem is learning to walk. You might have seen the hilarious videos of virtual bots learning to walk/run/jump using RL. Why can’t we teach a bot to walk using standard supervised learning? The three guides above tell us why. We can tell if a robot has walked from one place to another i.e. we can recognize the end state. But we cannot really label what a robot should do to walk. So we don’t have the ‘labels’ to train a model. Instead, we allow the robot to try different things, learn from past trials and then chain together a sequence of actions known as a trajectory that achieve the end state.

The agent is designed to maximize the reward function that we design. The reward function must be general enough to allow the agent to explore and find the best paths but also be narrow enough to prevent the agent getting lost and making no progress towards the end state we want to achieve. Thus designing this reward function is the most important task in RL.

Reinforcement learning with human feedback

Another classic RL problem is playing games, the most famous example being AlphaGo built by DeepMind. For these agents, the game score is an easy reward function that the agent needs to maximize. But for some tasks, the reward function is harder to define.

Alternatively, we can extract a reward function using demonstrations of the task. This can work for learning behavior such as walking, but only if the robot is a humanoid. What if the robot is a non-humanoid shape, like a rover? In such cases, we have neither demonstrations nor a well-defined reward function.

Conversation lies in this strange uncanny zone. There is no single mathematical way to define a good conversation because every conversation is different. Imitation is also impossible since it will require huge amounts of data to demonstrate all types of conversation. Yet, people can easily recognize a good friendly productive conversation when they see/hear it.

Thus in RLHF, instead of reward function or demonstrations, human’s evaluate the performance of the agent and this evaluation is used to improve the agent.

Intuition : Don’t learn the agent, learn the objective to learn the agent

The key insight in RLHF solves both problems (we don’t know the reward function and imitation learning is not label efficient) with just one weird trick - by using the human feedback to learn a reward function which can then train the agent.

Laying it out in steps -

Create reward function, r, as a function of some hyperparameters. For a specific choice of the reward function, the agent tries specific actions, learns using RL and follows a specific set of action trajectories.
Next, create a dataset of pairs of action trajectories and get humans to evaluate which trajectory is better in each pair.
Use gradient descent to find the optimal reward function which maximizes the weight of the action trajectories approved by humans.

In the RLHF paper, the agent was trained to play Atari games, and hence the human raters evaluated which trajectory was a better played game. In the case of GPT training, the raters evaluated which conversations were more human and more pleasant.

RLHF reduces both the amount and expertise of the human feedback used in RL. Creating a complex reward function to describe conversation would require an ML expert; however, even a non-expert can easily evaluate two conversations and choose which one is better. Also, while it would take a huge amount of labeled data to train an RL agent by imitation, learning a reward function which can then be used for RL greatly amplifies the effect of data and enables learning with <1% of the labels required.

Comments : Better to be rich than right

RLHF is still controversial in the ML circles. On the one hand it was the secret ingredient that turned a complex heap of probabilities that was GPT3 into the product with most successful launch in software history, ChatGPT. On the other hand, there is evidence GPT4 before RLHF was a well-calibrated model, whereas post RLHF it is not.

What does that mean? A calibrated model produces outputs with the same probability scores as the probability of finding that output in the model’s training data. Thus a calibrated model is a faithful reflection of its training data. However, RLHF distorts the calibration to match the preferences of the humans who provide the feedback. This allows the feedback providers to control what the model does, raising issues of control, bias and censorship. It has sparked new lines of inquiry, for instance whether the RLHF-ed model is better at manipulating humans by saying what they want to hear or if the model is more likely to hallucinate fake information.

———————————————————————————————————
There you have it, my intuitions on how RLHF works and makes a seemingly dumb next token predictor into a uncanny human-like conversational agent. For more such intuitions on AI/ML, subscribe and follow me on Twitter. You can also check out my other projects on nirsd.com.