PPO(Proximal Policy Optimization)是rlhf经典算法,RLOO (REINFORCE Leave One-Out) 则是基于 PPO 改进的算法,TRL分别提供了PPOTrainer和RLOOTrainer的实现。下面我们分析下二者的异同。
1.关于模型
PPO需要加载四个模型:1) 策略模型(policy model),2) 参考策略模型(reference policy model),3) 奖励模型(reward model),以及 4) 价值模型(value model),而RLOO没有4) 价值模型(value model),只有其他三个模型。所以从显存来说RLOO肯定比PPO更省。
PPO将policy和value两个模型包裹在一起,不仅前馈的时候二者都有输出,而且在训练的时候两个模型也会同时进行训练。
PPOTrainer
class PolicyAndValueWrapper(nn.Module):
def __init__(self, policy, value_model) -> None:
super().__init__()
self.policy = policy
self.value_model = value_model
self.critic_backbone = getattr(value_model, value_model.base_model_prefix)
def forward(self, **kwargs):
output = self.critic_backbone(
**kwargs,
)
logits = self.value_model.score(output.hidden_states[-1])
return self.policy(**kwargs), logits
2.计算Reward
两种方法的奖励reward都包含了环境奖励,即reward model的输出和KL散度约束惩罚,但二者的计算方式不同。
从上图我们可以看出,PPO在计算奖励的时候将每个补全 token 视为单独的动作,但只有EOS token获得真正的奖励(score),输出格式为[batch_size, seq_len]。
PPOTrainer
# 4. compute rewards
kl = logprobs - ref_logprobs
non_score_reward = -args.kl_coef * kl
rewards = non_score_reward.clone()
actual_start = torch.arange(rewards.size(0), device=rewards.device)
actual_end = torch.where(sequence_lengths_p1 < rewards.size(1), sequence_lengths_p1, sequence_lengths)
rewards[[actual_start, actual_end]] += scores
而 RLOO 将整个补全视为单一动作, EOS 奖励归因于整个补全。因此RLOO rewards的格式是[batch_size, 1]。
RLOOTrainer
# 4. compute rewards
kl = logprobs - ref_logprobs
non_score_reward = (-args.kl_coef * kl).sum(1)
rlhf_reward = scores + non_score_reward
3.计算Advantage
在PPO算法里面,优势函数=动作价值函数-状态价值函数,即A(s, a) = Q(s, a) - V(s)。优势函数advantage是通过泛化优势估计算法(GAE)得来的,同时可以计算得到动作价值函数return。
PPOTrainer
# 6. compute advantages and returns
lastgaelam = 0
advantages_reversed = []
gen_length = responses.shape[1]
for t in reversed(range(gen_length)):
nextvalues = values[:, t + 1] if t < gen_length - 1 else 0.0
delta = rewards[:, t] + args.gamma * nextvalues - values[:, t]
lastgaelam = delta + args.gamma * args.lam * lastgaelam
advantages_reversed.append(lastgaelam)
advantages = torch.stack(advantages_reversed[::-1], axis=1)
returns = advantages + values
advantages = masked_whiten(advantages, ~padding_mask)
advantages = torch.masked_fill(advantages, padding_mask, 0)
而在RLOO里面,优势函数=总奖励-虚拟基线。虚拟基线是多次采样后的除了该采样本身的平均奖励,这也是Leave One-Out的由来。该采样的奖励-其他平均采样的奖励,和基于该动作的价值-所有动作的平均价值在理论上是一致的。这里的rloo_k是指总采样次数。
RLOOTrainer
# vectorized RLOO advantages implementation
rlhf_reward = rlhf_reward.reshape(args.rloo_k, -1)
baseline = (rlhf_reward.sum(0) - rlhf_reward) / (args.rloo_k - 1)
advantages = rlhf_reward - baseline
advantages = advantages.flatten()
4.计算Loss
首先两种方法在计算policy model loss的时候都使用了clip方法。
PPO除此之外还会计算value model loss
下面是PPO的流程图,可以看出policy model和value model都会进行训练。
PPOTrainer
vf_losses1 = torch.square(vpred - mb_return)
vf_losses2 = torch.square(vpredclipped - mb_return)
vf_loss_max = torch.max(vf_losses1, vf_losses2)
vf_loss = 0.5 * masked_mean(vf_loss_max, ~padding_mask_p1[micro_batch_inds])
vf_clipfrac = masked_mean(
(vf_losses2 > vf_losses1).float(), ~padding_mask_p1[micro_batch_inds]
)
logprobs_diff = new_logprobs - mb_logprobs
ratio = torch.exp(logprobs_diff)
pg_losses = -mb_advantage * ratio
pg_losses2 = -mb_advantage * torch.clamp(ratio, 1.0 - args.cliprange, 1.0 + args.cliprange)
pg_loss_max = torch.max(pg_losses, pg_losses2)
pg_loss = masked_mean(pg_loss_max, ~padding_mask[micro_batch_inds])
loss = pg_loss + args.vf_coef * vf_loss
而RLOO只计算policy model的loss。
RLOOTrainer
new_ratio = (new_logprobs - mb_logprobs).exp()
new_logprobs = new_logprobs.sum(1)
mb_logprobs = mb_logprobs.sum(1)
logprobs_diff = new_logprobs - mb_logprobs
ratio = torch.exp(logprobs_diff)
pg_losses = -mb_advantage * ratio
pg_losses2 = -mb_advantage * torch.clamp(ratio, 1.0 - args.cliprange, 1.0 + args.cliprange)
pg_loss_max = torch.max(pg_losses, pg_losses2)
pg_loss = pg_loss_max.mean()
loss = pg_loss