随笔

2026.01.15

  1. RL has limited effectiveness when applied to extremely underfit or overfit initial checkpoints [1].

  2. Despite RL’s superior generalization, we show that SFT is still helpful for effective RL training: SFT stabilizes the model’s output format, enabling subsequent RL to achieve its performance gains.

[1]. SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training.

2026.01.05

  • topdown学习策略:为了解决一个问题,涉及到哪些知识就去学习。