Reasoning
Once a high-quality process reward model is trained, we integrate it into the decoding process alongside the language model, enabling guided search and scoring or voting across multiple generations.
OpenR supports various search algorithms — such as beam search, best-of-N selection and Monte-carlo Tree Search. We will be contantly update the framework to include more classic reasoning methods.
Table of contents
Basic Command
To run inference using the Qwen 1.5B model as the policy model and Mistral-7B as the reward model on the MATH dataset, we need to first set-up two LLM services by running the following command:
sh reason/llm_service/create_service_qwen2.5_math_hf.sh
Next we run the evaluation script:
python reason/evaluation/evaluate.py \
--LM Qwen2.5-Math-1.5B-Instruct \
--RM math-shepherd-mistral-7b-prm \
--task_name MATH \
--temperature 0.7 \
--num_sequence 8 \
--max_new_tokens 2048 \
--method best_of_n
Parameters
--LM
: Specifies the policy model. Available models includeQwen2.5-Math-1.5B-Instruct
and others. Replace with the desired model path or name. The code is compatible with various base models, including the Llama, Qwen, and Mistral series.--RM
: Sets the reward model. In this example,math-shepherd-mistral-7b-prm
is used, which can be replaced with any supported reward model.--task_name
: Sets the dataset/task name. Examples includeMATH
,gsm8k
, etc.--temperature
: Controls the randomness of the model output. A higher temperature increases randomness, while a lower value makes the output more deterministic.--num_sequence
: Number of sequences generated per inference step. Useful for methods likebest_of_n
, where multiple outputs are generated and evaluated.--max_new_tokens
: Specifies the maximum number of new tokens to generate.--method
: Chooses the search strategy. Options include:vanila_mcts
- Vanilla Monte Carlo Tree Searchbeam_search
- Beam Searchbest_of_n
- Select the best sequence fromn
generated options
--tree_max_depth
: Specifies the maximum depth of the search tree. Primarily useful when using tree-based methods like MCTS or beam search.--tree_max_width
: Limits the number of child nodes each tree node can have. Useful for controlling memory usage and search complexity.--save_dir
: Directory where model checkpoints or results will be saved. If not specified, results are not saved.--resume_dir
: Directory containing previous checkpoints to resume training or evaluation from. If specified, the script will continue from the last saved state.--local
: When set, runs the script in local debug mode, reducing parallelism and simplifying setup.--num_worker
: Specifies the number of parallel worker processes. Useful for speeding up processing with large datasets or multi-agent environments.