Reference
Inference-time Computing
[1]Alphazero-like tree-search can guide large language model decoding and training.
[2]Reasoning with language model is planning with world model.
[3]Scaling llm test-time compute optimally can be more effective than scaling model parameters
[4]Think before you speak: Training language models with pause tokens
From Outcome Supervision to Process Supervision
[1]Training verifiers to solve math word problems
[2]Solving math word problems with process-and outcome-based feedback
[3]Let’s verify step by step
[4]Making large language models better reasoners with step-aware verifier
[5]Ovm, outcome-supervised value models for planning in mathematical reasoning
[6]Generative verifiers: Reward modeling as next-token prediction
Data Acquisition
[1]Star: Bootstrapping reasoning with reasoning
[2]Quiet-star: Language models can teach themselves to think before speaking
[3]Improve mathematical reasoning in language models by automated process supervision
[4]Math-shepherd: Verify and reinforce llms step-by-step without human annotations