SkeptiSTEM-4B-v2 Stage CD (Chat + DPO LoRA)

This is the Stage CD LoRA combining chat restoration (Stage C) and preference tuning (Stage D).

Purpose

Restores normal conversational ability after GRPO training while maintaining:

Verification skills for suggested answers
Structured reasoning when appropriate
Helpful, accurate responses

Training Details

Stage C: Chat SFT

Dataset: ultrachat_200k (~15,000 examples)
Purpose: Restore conversational ability
Epochs: 1

Stage D: DPO

Dataset: ultrafeedback_binarized (~59,916 preference pairs)
Purpose: Preference alignment
Beta: 0.1

Expected Load Order

Base: HallD/SkeptiSTEM-4B-v2-stageR1-merged-16bit
Merge/apply R2: HallD/SkeptiSTEM-4B-v2-stageR2-format-lora
Merge/apply R3: HallD/SkeptiSTEM-4B-v2-stageR3-grpo-lora
Apply this CD adapter

Usage

from unsloth import FastLanguageModel
from peft import PeftModel

# Load base
base, tok = FastLanguageModel.from_pretrained(
    "HallD/SkeptiSTEM-4B-v2-stageR1-merged-16bit",
    max_seq_length=4096,
    load_in_4bit=True,
)

# Merge R2 + R3
base = PeftModel.from_pretrained(base, "HallD/SkeptiSTEM-4B-v2-stageR2-format-lora")
base = base.merge_and_unload()
base = PeftModel.from_pretrained(base, "HallD/SkeptiSTEM-4B-v2-stageR3-grpo-lora")
base = base.merge_and_unload()

# Apply CD
model = PeftModel.from_pretrained(base, "HallD/SkeptiSTEM-4B-v2-stageCD-chat-dpo-lora")

FastLanguageModel.for_inference(model)

Behavior

The model now:

Responds conversationally by default (no format tags unless asked)
Still verifies suggestions when present in prompts
Provides helpful, accurate, preference-aligned responses

Trained with Unsloth.

Downloads last month: 3

Model tree for HallD/SkeptiSTEM-4B-v2-stageCD-chat-dpo-lora

Base model

Qwen/Qwen3-4B-Base

Finetuned

unsloth/Qwen3-4B-Base

Finetuned

HallD/SkeptiSTEM-4B-v2-stageR1-merged-16bit

Adapter

(3)

this model