Reference

Inference-time Computing

[1]Alphazero-like tree-search can guide large language model decoding and training.

[2]Reasoning with language model is planning with world model.

[3]Scaling llm test-time compute optimally can be more effective than scaling model parameters

[4]Think before you speak: Training language models with pause tokens

From Outcome Supervision to Process Supervision

[1]Training verifiers to solve math word problems

[2]Solving math word problems with process-and outcome-based feedback

[3]Let’s verify step by step

[4]Making large language models better reasoners with step-aware verifier

[5]Ovm, outcome-supervised value models for planning in mathematical reasoning

[6]Generative verifiers: Reward modeling as next-token prediction

Data Acquisition

[1]Star: Bootstrapping reasoning with reasoning

[2]Quiet-star: Language models can teach themselves to think before speaking

[3]Improve mathematical reasoning in language models by automated process supervision

[4]Math-shepherd: Verify and reinforce llms step-by-step without human annotations