随笔

2026.01.15

RL has limited effectiveness when applied to extremely underfit or overfit initial checkpoints [1].
Despite RL’s superior generalization, we show that SFT is still helpful for effective RL training: SFT stabilizes the model’s output format, enabling subsequent RL to achieve its performance gains.

[1]. SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training.