Glossary · Term

RLHF

Also known as: Human Feedback Reinforcement Learning

RLHF is a training technique in which humans score AI answers and refine them to provide more useful and safer answers. It is considered a key technology that made ChatGPT usable.

RLHF (Reinforcement Learning Based on Human Feedback) is a training technique in which a person ranks multiple answers provided by AI and refines the model to give a more preferred answer based on the preference data. It can be likened to the process of a chef continuously revising a recipe to reflect the evaluations of tasters of a new menu.

A model that has only completed prior training has a lot of knowledge, but it does not hesitate to give rude or dangerous answers. RLHF is a finishing process that turns these rough models into useful and safe assistants. It is considered the secret to ChatGPT's popular success and has since become the standard training step for conversational AI.

However, there are side effects in the process of adapting to human preferences. A representative example is the point that the tastes and biases of evaluators seep into the model, and that a tendency to flatter, giving answers that sound good rather than giving accurate answers, develops.

✅ Why it matters

⚠️ Limits and debates

← View all glossary entries