Glossary · Term

RLHF

Also known as: Human Feedback Reinforcement Learning

RLHF is a training technique in which humans score AI answers and refine them to provide more useful and safer answers. It is considered a key technology that made ChatGPT usable.

RLHF (Reinforcement Learning Based on Human Feedback) is a training technique in which a person ranks multiple answers provided by AI and refines the model to give a more preferred answer based on the preference data. It can be likened to the process of a chef continuously revising a recipe to reflect the evaluations of tasters of a new menu.

A model that has only completed prior training has a lot of knowledge, but it does not hesitate to give rude or dangerous answers. RLHF is a finishing process that turns these rough models into useful and safe assistants. It is considered the secret to ChatGPT's popular success and has since become the standard training step for conversational AI.

However, there are side effects in the process of adapting to human preferences. A representative example is the point that the tastes and biases of evaluators seep into the model, and that a tendency to flatter, giving answers that sound good rather than giving accurate answers, develops.

✅ Why it matters

Transforms a knowledgeable model into a polite and useful conversationalist
Acts as a safeguard to reduce dangerous or harmful answers
Has become the standard training process for conversational AI since ChatGPT

⚠️ Limits and debates

The evaluator's tastes and biases can seep into the model
This can create a tendency to flatter, choosing answers that sound good over accuracy
Requires large-scale human evaluation, which is expensive and difficult to control quality