Supervised Training for PRMs
In Process-supervision Reward Models (PRMs), the goal is to determine whether the sequence of solution process is currently on the right track, so it should output a binary indicator of correctness.
OpenR trains PRMs through supervised fine-tuning on an LLM, with the correct/incorrect distinction serving as the classification label.
Data Preprocessing
The datasets we used to train our PRM include PRM800K, Math-Shepherd and our dataset — MATH-APS. These datasets are structured into three parts:
- Question:
"question" : "Three pencils and a jumbo eraser cost $\\$1.24$. Five pencils and a jumbo eraser cost $\\$1.82$. No prices include tax. In cents, what is the cost of a pencil?"
- Process: the solution is broken down into multiple steps, with each step separated by a special step token represented as
\n\n\n\n\n
, indicating the end of a step, at which point the PRM can make predictions.
"process" :
"Step: 1: Let's call the price of a pencil p and the price of a jumbo eraser e. Then we can write two equations. \n\n\n\n\n
Step: 2: The first equation is $3p+e=124$. \n\n\n\n\n
Step: 3: To solve this system, let's subtract the first equation from the second equation. This will eliminate e. \n\n\n\n\n
Step: 4: $5p+e-3p-e=1.82-1.24$. \n\n\n\n\n
Step: 5: This simplifies to $2p=0.58$. So $p=0.29$. \n\n\n\n\n
Step: 6: We could also solve this system by substitution. \n\n\n\n\n"
- Label: corresponds to the classification for all the steps within the entire process, and it is either a
+
or a-
based on the correctness of the process.
"label" : ["+", "-", "+", "+", "+", "+"]
More details of data preprocessing can be found in Datasets.
Evaluation & Fine-tuning
Our method involves defining a special step token, denoted as \n\n\n\n\n
, followed by two additional tokens representing positive and negative feedback, denoted as +
and -
. We then use the LLM to predict the next token of the step token (implemeted by preprocess_function()
in prm/supervise/evaluate.py
).
From the logits of the positive and negative tokens in the position, we apply softmax and use the score of the +
token as the prediction result (retrieved by preprocess_logits_for_metrics()
in prm/supervise/evaluate.py
).
Then we can either evaluate or train through:
# Initialize the Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets['train'],
eval_dataset=tokenized_datasets['test'], # Replace with a validation set if available
data_collator=data_collator,
tokenizer=tokenizer,
preprocess_logits_for_metrics=preprocess_logits_for_metrics,
compute_metrics=compute_metrics,
)
trainer.evaluate()
trainer.train()
OpenR currently provide training code for both Llama and Qwen base model:
\\ single gpu
python finetune_qwen . py --total_batch_size 256 \
--learning_rate 1e-4 \
--datasets all \
\\ multi gpu
torchrun -- nproc_per_node = 2 finetune_qwen.py --total_batch_size 256 \
--learning_rate 1e-4 \
--datasets all \