Quick Start

Welcome to the OpenR Manual, designed to guide you through the process of training large language models (LLMs) to reason effectively. Here we provide a quick guide of how to successfully run the codebase of OpenR.

Prerequisites
Installation
1. Create Environment using Conda
2. Download Base Models
Start a LLM Service
Run Inference
Run Training
Run PRM Training

Prerequisites

We have tested our code on machines with minimum 2 x A800 GPUs, each with 80GB memory. For optimal performance, it is recommended to run the project on machine with at least 80GB of GPU memory.

Installation

Create Environment using Conda

Create and activate a new conda environment

conda create -n open_reasonser python=3.10
conda activate open_reasonser 

Intall dependencies

pip install -r requirements.txt
pip3 install  "fschat[model_worker,webui]"
pip install -U pydantic
cd envs/MATH/latex2sympy
pip install -e .
cd -

Download Base Models

Before running the project, please ensure that all required base models are downloaded. The models used in this project include:

Qwen2.5-Math-1.5B-Instruct, Qwen2.5-Math-7B-Instruct
Qwen2.5-Math-RM-72B
peiyi9979/mistral-7b-sft
peiyi9979/math-shepherd-mistral-7b-prm

To download these models, please refer to the Hugging Face model downloading tutorial for step-by-step guidance on downloading models from the Hugging Face Hub.

Ensure that all models are saved in their directories according to the project setup before proceeding.

Start a LLM Service

Before running inference, please modify the following variables in the reason/llm_service/create_service_math_shepherd.sh script to set the appropriate base models for your usage:

$MODEL_BASE: Set this to the directory where your models are stored.
$POLICY_MODEL_NAME: Set this to the name of the policy model you wish to use.
$VALUE_MODEL_NAME: Set this to the name of the value model you wish to use.
$NUM_LM_WORKER: Set this to the number of language model (LM) workers to start.
$NUM_RM_WORKER: Set this to the number of reward model (RM) workers to start.

sh reason/llm_service/create_service_math_shepherd.sh

What this scrip does is to run two separate LLM services on your hardware, for generation and value inference respectively. After successfully running the script, you will be able to see the running services as:

$ ps -ef | grep openr

xxx    623984  175535 17 18:47 pts/9  00:00:08
/home/yanxue/anaconda3/envs/openr/bin/python3 -m fastchat.serve.controller --port 28777 --host 0.0.0.0

xxx    624073  623967 99 18:47 pts/10   00:04:37 
anaconda3/envs/openr/bin/python3 -m reason.llm_service.workers.reward_model_worker --model-path /mnt/nasdata/xxx/llms/huggingface/math-shepherd-mistral-7b-prm --controller-address http://0.0.0.0:28777 --host 0.0.0.0 --port 30011 --worker-address http://0.0.0.0:30011


xxx 624074  623975 99 18:47 pts/11   00:01:44 
anaconda3/envs/openr/bin/python3 -m reason.llm_service.workers.vllm_worker --model-path /mnt/nasdata/xxx/llms/huggingface/mistral-7b-sft --controller-address http://0.0.0.0:28777 --host 0.0.0.0 --port 30010 --worker-address http://0.0.0.0:30010 --dtype bfloat16 --swap-space 32

Run Inference

⚠️Tips: Make sure the input (--LM, --RM) in the script aligns with variable ($POLICY_MODEL_NAME, $VALUE_MODEL_NAME) in the pending worker!

export PYTHONPATH=$(pwd)
sh scripts/eval/cot_greedy.sh

# Method: cot. Average result: ({'majority_vote': 0.734, 'total_completion_tokens': 559.13},)

sh scripts/eval/cot_rerank.sh

# Method: best_of_n. Average result: ({'majority_vote': 0.782, 
#                                       'prm_min_max': 0.772, 
#                                       'prm_min_vote': 0.792, 
#                                       'prm_last_max': 0.776, 
#                                       'prm_last_vote': 0.792, 
#                                       'total_completion_tokens': 4431.268},)

sh scripts/eval/beam_search.sh

# Method: beam_search. Average result: ({'majority_vote': 0.74, 'total_completion_tokens': 2350.492},)

Run Training

Before training, please modify the $dataset_path, $model_name_or_path and $prm_name_or_path in train/mat/scripts/train_llm.sh.

cd train/mat/scripts
bash train_llm.sh

Run PRM Training

cd prm/code

// single gpu
python finetune_qwen_single_gpu.py --model_path $YOUR_MODEL_PATH \
                                   --train_data_path $TRAIN_DATA_PATH \
                                   --test_data_path $TEST_DATA_PATH


// multi gpu
torchrun --nproc_per_node=2 finetune_qwen.py --model_path $YOUR_MODEL_PATH \
                                             --data_path $YOUR_DATA_FOLDER_PATH \
                                             --datasets both \