OlmoLogic: Boosting Reasoning via RLVR with Inductive Logic Programming
💻 Code: OpenInstruct (training) | Olmes (eval) | SLR
🧩 Paper: SLR | Reward Hacking
Authors: Lukas Helff, Felix Friedrich, Sebastian Sztwiertnia, Antonia Wüst, David Steinmann, Hikaru Shindo, Ahmad Omar, Quentin Delfosse, Ruben Härle, Rupert Mitchell, Tim Woydt, Wolfgang Stammer, Patrick Schramowski, Kristian Kersting. (See Acknowledgments)
Teaching reasoning models to think logically. While open RLVR recipes predominantly focus on math and code, logical reasoning is often left behind. We integrated ILP into the Olmo-3 RLVR recipe, grounding rewards directly in the execution of logic programs. A Prolog interpreter runs the model's proposed rules against the task, exactly like a Python interpreter running code against test cases. The resulting model, OlmoLogic, was trained from scratch on 56×H100 GPUs for 6 days straight (3,350 optimization steps). The resulting model, OlmoLogic, triples SLR-Bench accuracy (15.1 → 45.1) and lifts Olmo's reasoning capabilities across a wide range of other logic benchmarks, while holding math, code, and IF performance steady.
SLR-Bench for RLVR
We extend the Olmo-3 RLVR mix (Dolci-Think-RL-7B) with inductive logic programming tasks from Scalable Logical Reasoning (SLR). SLR synthesizes logical reasoning tasks using inductive logic programming. Each task embeds a reasoning problem of various complexity levels and ships with an automatic Prolog verifier. SLR-Bench, the static benchmark, organizes 19k tasks into a 20-level curriculum from Basic to Hard, scaling rule length, vocabulary, problem size, and several other factors. Crucially, a Prolog backend executes any proposed rule on the task to check whether it solves the task. This same execution feedback becomes the RLVR reward.
SLR Tasks. SLR follows the classical ILP paradigm. The model is given background knowledge B describing a set of trains via car properties (e.g., colors, lengths, wall types), plus labeled examples: eastbound(t) (positive, E+) and westbound(t) (negative, E-). The objective is to induce a hypothesis H: a minimal logic rule explaining the labeling by abstracting relational patterns over car properties. The rule must cover every positive sample (completeness) and no negative sample (consistency). A task looks like this:
You are a train classifier observing trains traveling either east- or westbound. Each train is
composed of one or more cars, and each car is characterized by a set of properties, represented
as ground atoms over a fixed set of predicates. The direction (eastbound or westbound) of a
train is to be determined from its composition.
To describe the trains we define a set of predicates and grounding domains:
'has_car(Train, Car)': Car is part of train Train.
'car_num(Car, CarNumber)': position of the car within its train (positive integer).
'car_color(Car, Color)': red, blue, green, yellow, or white.
'car_len(Car, Length)': short or long.
'has_wall(Car, WallType)': full or railing.
You are provided with positive and negative examples in the form of eastbound(t) or westbound(t)
for each train t, together with background knowledge consisting of ground facts over the above
predicates that describe its composition.
eastbound(train0). westbound(train1).
has_car(train0, car0_1). has_car(train1, car1_1).
car_num(car0_1, 1). car_num(car1_1, 1).
car_color(car0_1, red). car_color(car1_1, blue).
car_len(car0_1, long). car_len(car1_1, short).
has_wall(car0_1, railing). has_wall(car1_1, railing).
Your task is to formulate a hypothesis, i.e. a Prolog rule of the form 'eastbound(Train) :- Body.'
that correctly distinguishes eastbound from westbound trains. The hypothesis must be true for all
positive examples and false for all negative examples. Aim for the shortest correct rule: one
that uses the fewest possible body literals subject to the prior constraints. The rule must use
only predicates defined above and must perfectly separate eastbound from westbound trains.
A valid hypothesis: "A train is eastbound if it carries a red car." Expressed as Prolog, it can be executed directly on the problem to check completeness and consistency. Here it correctly labels train0 as eastbound and does not label the westbound train.
Training Recipe
We start from allenai/Olmo-3-7B-Think-DPO and run GRPO via a slurm-adapted open-instruct setup.
Data mix. allenai/Dolci-Think-RL-7B (the existing Olmo-3 RLVR mix) and AIML-TUDA/SLR-Bench:v1-All at 1:1 dataset weights. SLR's background theories make its prompts comparatively long, so we raise the prompt cap to 5,000 tokens, up from Olmo3's published 2,048, admitting tasks more than twice that length. The resulting breakdown of the mix used for OlmoLogic, where SLR contributes with 8.4%. Rewards are routed by source: SLR uses the Prolog symbolic verifier (isomorphic variant, see above); Dolci tasks keep their original code, math, and judge verifiers, with Qwen/Qwen3-32B as the LLM judge. Full dataset breakdown:
| Source | Prompts | Share |
|---|---|---|
| IF Multi-Constraint | 29,813 | 26.8% |
| OMEGA Math | 15,000 | 13.5% |
| AceCoder | 10,107 | 9.1% |
| SLR-Bench | 9,402 | 8.4% |
| Tulu 3 Rewritten | 7,109 | 6.4% |
| Multi-Subject RLVR | 7,106 | 6.4% |
| AceReason-Math | 6,598 | 5.9% |
| WildChat English | 6,421 | 5.8% |
| KlearReasoner Code | 6,272 | 5.6% |
| SYNTHETIC-2 / PrimeIntellect | 3,000 | 2.7% |
| MathSub-30K | 2,999 | 2.7% |
| ORZ Math | 2,999 | 2.7% |
| DAPO-Math | 2,584 | 2.3% |
| Llama-Nemotron Post-Training | 2,006 | 1.8% |
| Total | 111,416 | 100% |
Optimization. We follow the OLMES GRPO defaults: β = 0 (no KL anchor), constant LR 1e-6, advantage normalization centered on the prompt mean, truncated importance-sampling ratio cap 2.0, clip-higher 0.272, vLLM temperature 1.0. Global Batch Size: 64 prompts × 8 rollouts = 512. Max prompt length 5k tokens, response length 25k, packed to 35.8k. We train for 3,350 steps (~2 epochs). While Olmo only trained for 1,500 steps, we observe that our rewards are still climbing well past the one-epoch mark and only level off in the second epoch.
Throughput. 7× H100 nodes (8 GPUs each). One judge node (Qwen3-32B vLLM + Prolog verifier API), one Ray head running the trainer (DeepSpeed ZeRO-3, 8 learners), and five worker nodes hosting 40 vLLM inference engines. Async rollouts with 8-step lookahead and inflight weight updates.
Reward design
Rewards come from Prolog execution: we run the proposed hypothesis on the validation program of the task. We use the isomorphic variant (from LLMs Gaming Verifiers) to block reward hacking. Execution returns completeness and consistency, giving overall rule-classification accuracy (TP+TN over all samples). Every SLR task is binary classification, so a universally satisfied rule trivially hits 50%, as does an unsatisfiable one. Anything below 50% is information-negative and maps to a reward of 0.
Reward shape. Let be the rule-classification accuracy, the simplicity bonus, the gate, and the exponent:
The shape does three things. The fourth-power compression on partial credit (p^4) means 50% accuracy maps to ~0.016, 80% to 0.26, 95% to 0.74 — the model is rewarded for getting almost everything right, not for getting most things right. The hard gate at p = 0.5 zeros out anything worse than a coin flip on a balanced binary task. And the strict separation between partial (capped at 9.0) and full-correct (starts at 9.5) ensures that within a GRPO group of 8 rollouts, a fully correct rollout's advantage is never undercut by a slightly simpler but partial correct one. The simplicity bonus enters multiplicatively, nudging the models towards simpler rules. Syntax validity is tracked as a metric but contributes no positive reward — any rule that executes and returns p > 0 is syntactically valid by definition.
Dolci tasks keep their original verifiers: code-execution reward for code tasks (code_pass_rate_reward_threshold=0.99), the verifiable math and reasoning verifiers, and Qwen/Qwen3-32B as the LLM judge where one is needed. SLR and Dolci verifiers run side by side, routed by source field.
Training dynamics
Curves below come from W&B run helff/Reward-Shortcut/RewardHacking-Olmo3-SLR-IsoRL (3,350 steps). Raw per-step values are in light gray; the blue line is a 30-step rolling mean.
The top row tracks what we care about. Prompts solved per batch climbs from ~2/64 to ~15/64. Overall verifiable reward climbs from 4.5 to nearly 6.5. SLR-Bench reward nearly reaches 8. The other heads (math, code, IF-eval, general-quality) stay stable; none of the Dolci reward signals collapse as SLR enters the mix.
Two stretches of turbulence stand out: around step 900 and after step 2,500. The latter followed a forced termination and resumed at step 2,500, after which the logprob drift reappeared. Both show divergence between trainer and inference logprobs, and step 900 brings a sharp blow-up in response length accompanied by degenerate repetitions. Reward signals dip alongside. We chose not to restart, as restarts at this scale are expensive, and the run was still trending in the right direction. Though a targeted restart might have helped clear the logprob drift more cleanly. As it stands, after several hundred steps, rewards are back on track and continue climbing, and the logprob gap normalizes.
Going two epochs. The published Olmo-3 recipe runs one epoch (~1,500 steps); we train for two (3,350 steps). Two reasons. First, a non-trivial chunk of epoch 1 was effectively spent recovering from the logprob fluctuations above rather than learning new behavior, so the effective learning budget was smaller than the step count suggested. Second, rewards only start to converge in the second epoch — the SLR head and overall verifiable reward both keep climbing well past the one-epoch mark and only level off later. The post-recovery trajectory continues to improve, so we keep the full 3,350-step run as the released checkpoint.
Introducing Olmo 3.1 7B Think
Going to two epochs raised two questions. Can we further push Olmo-3-Think by continuing RLVR on the original mix? And does the extra compute alone explain OlmoLogic's gains, or is SLR doing the work? To answer both, we trained Olmo 3.1 7B Think, which extends the official one-epoch Olmo 3 7B Think with ~1 additional epoch (1,850 RLVR steps) using the original Olmo-3 RLVR mix without SLR. We reuse the default Olmo-3 RLVR settings. This gives Olmo 3.1 7B Think the same total step budget as OlmoLogic (3,350 steps), making it a clean compute-matched control for the SLR ablation.
As a standalone release, Olmo 3.1 7B Think improves upon Olmo 3 7B Think on instruction following (+6.6), safety (+3.8), and several individual reasoning benchmarks. The main regression is on Chat (−10.5), a known cost of extended RLVR on verifiable-reward tasks; we observe similar behavior on OlmoLogic (−7.6). We hypothesize that increasing the proportion of LLM-judge-style rewards in the mix might help regularize this. Smaller shifts on code (−1.6) and knowledge (−0.5) are within noise. For groups building on Olmo-3-Think that care more about reasoning and instruction following than open-ended chat, it's a drop-in upgrade with a fully open recipe.
As a control, it lets us cleanly attribute the SLR-specific gains. At the same step budget, the SLR mixture helps to teach logical reasoning. Olmo 3.1 7B Think barely moves on SLR-Bench (+0.6) and shows no meaningful movement on the held-out logic suite (LogiGLUE, KOR-Bench, bAbI 16, CLUTRR, FOLIO, ProntoQA, RuleBERT, abductive reasoning). OlmoLogic, with the same step budget but SLR in the mix, reaches 45.1 on SLR-Bench (+30.0) and gains broadly across the held-out logic benchmarks (+5.4 on average).
| Olmo-3-7B-Think | Olmo 3.1 7B Think | OlmoLogic 7B Think | |
|---|---|---|---|
| SLR-Bench | 15.1 | 15.7 (+0.6) | 45.1 (+30.0) |
| Math (avg) | 71.1 | 70.5 (−0.5) | 73.0 (+1.9) |
| Reasoning (avg) | 75.8 | 76.7 (+0.9) | 76.6 (+0.8) |
| Logic (avg) | 59.1 | 59.1 (+0.0) | 64.4 (+5.4) |
| Coding (avg) | 76.6 | 75.0 (−1.6) | 74.8 (−1.8) |
| IF (avg) | 64.9 | 71.5 (+6.6) | 66.6 (+1.7) |
| Knowledge (avg) | 49.2 | 48.7 (−0.5) | 49.5 (+0.3) |
| Chat (avg) | 52.1 | 41.6 (−10.5) | 44.5 (−7.6) |
| Safety (avg) | 70.7 | 74.5 (+3.8) | 74.0 (+3.3) |
Evaluation
All numbers come from a single, reproducible pipeline. We use the OLMES suite with its default configuration and hyperparameters, extending it with several reasoning benchmarks not previously covered (KOR-Bench, LogiGLUE, LogiQA, and others). We reuse the scores from the Olmo-3 paper when our setup matches theirs. For OMEGA and BBH our reruns diverged from the paper, so we report the OLMES default for those benchmarks across all models in the table.
Takeaways
- SLR teaches broad logical reasoning capabilities via RLVR. One dataset added, one verifier wired in. No training-stack changes.
- SLR induces strong reasoning transfer. OlmoLogic clearly surpasses the base Olmo model on reasoning benchmarks, including ones it was never trained on.
- Logic-program execution is an efficient and faithful gold-standard oracle. No judge model, no learned reward, no proxy.
- Two checkpoints released. OlmoLogic 7b Think for logical reasoning; Olmo 3.1 7B Think as a stronger Olmo-3-Think base for downstream usage.
Citation
This work is based on the following two papers. If you build on it, please cite:
For the SLR-Bench, please cite:
@inproceedings{helff2025slr,
title = {{SLR: Automated Synthesis for Scalable Logical Reasoning}},
author = {Helff, Lukas and Omar, Ahmad and Friedrich, Felix and W{\"u}st, Antonia
and Shindo, Hikaru and Woydt, Tim and Mitchell, Rupert
and Schramowski, Patrick and Stammer, Wolfgang and Kersting, Kristian},
booktitle = {Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)},
year = {2026},
url = {https://openreview.net/forum?id=omMnuTTEn7}
}
For the Reward Hacking paper, please cite:
@inproceedings{helff2026llms,
title = {{LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking}},
author = {Lukas Helff and Quentin Delfosse and David Steinmann and Ruben H{\"a}rle
and Hikaru Shindo and Patrick Schramowski and Wolfgang Stammer
and Kristian Kersting and Felix Friedrich},
booktitle = {ICLR 2026 Workshop on Logical Reasoning of Large Language Models},
year = {2026},
url = {https://openreview.net/forum?id=4B3WfRNqe3}
}
Acknowledgments
We would like to acknowledge the contributions of all authors of this work: Lukas Helff, Felix Friedrich, Sebastian Sztwiertnia, Antonia Wüst, David Steinmann, Hikaru Shindo, Ahmad Omar, Quentin Delfosse, Ruben Härle, Rupert Mitchell, Tim Woydt, Wolfgang Stammer, Patrick Schramowski, and Kristian Kersting.
We acknowledge support from the DFKI and hessian.AI Innovation Lab (funded by the Federal Ministry of Research, Technology and Space, BMFTR, grant no. 16IS22091), the hessian.AISC Service Center (funded by the Federal Ministry of Education and Research, BMBF, grant no. 01IS22091), and the Center for European Research in Trusted AI (CERTAIN). Further, this work benefited from the ICT-48 Network of AI Research Excellence Centers "TAILOR" (EU Horizon 2020, GA No 952215), the Hessian research priority program LOEWE within the project "WhiteBox", the HMWK cluster projects "Adaptive Mind" and "Third Wave of AI", and from the NHR4CES. This work has also benefited from the BMWK project "Sovereign Open Source Foundational Models for European Intelligence (SOOFI)," 13IPC040G, and from early stages of the Cluster of Excellence "Reasonable AI" funded by the German Research Foundation (DFG) under Germany's Excellence Strategy — EXC-3057; funding will begin in 2026. This work was supported by the Priority Program (SPP) 2422 in the subproject "Optimization of active surface design of high-speed progressive tools using machine and deep learning algorithms" funded by the DFG. Further, this work was funded by the AlephAlpha Collaboration Lab 1141. This work was supported in part by OpenAI Research Credits.




