随笔
2026.01.15
RL has limited effectiveness when applied to extremely underfit or overfit initial checkpoints [1].
Despite RL’s superior generalization, we show that SFT is still helpful for effective RL training: SFT stabilizes the model’s output format, enabling subsequent RL to achieve its performance gains.
[1]. SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training.
2026.01.05
topdown学习策略:为了解决一个问题,涉及到哪些知识就去学习。