RL Training with OpenEnv: 2048 Game

This tutorial covers training a language model to play the 2048 game using reinforcement learning with GRPO (Group Relative Policy Optimization).

Time: ~45 minutes | Difficulty: Advanced | GPU Required: Yes (T4 or better)

What You’ll Learn

Model Setup: Load and configure LLMs with Unsloth for efficient RL
Environment Connection: Connect to the 2048 OpenEnv environment
Reward Design: Create effective reward functions
GRPO Training: Train models with reinforcement learning
Deployment: Save and deploy trained models

Prerequisites

Before starting this tutorial, you should have completed the Getting Started series to understand:

How OpenEnv environments work
The reset/step/state API pattern
How to connect to environments

You’ll also need:

A GPU (free T4 on Google Colab works)
Basic understanding of PyTorch
~30 minutes for training

Part 1: Environment Setup

Installation

# Install required packages
!pip install -q unsloth openenv trl

# For Google Colab, also run:
!pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

Imports

import torch
from dataclasses import dataclass
from typing import List, Optional, Dict, Any
import random

# Check GPU availability
print(f"GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

Part 2: Model Configuration

We use Unsloth for memory-efficient training with LoRA adapters.

Configuration Classes

@dataclass
class ModelConfig:
    """Configuration for loading LLM models."""
    model_name: str = "unsloth/Qwen2.5-1.5B"
    max_seq_length: int = 768
    load_in_4bit: bool = True
    dtype: Optional[str] = None  # Auto-detect


@dataclass
class LoRAConfig:
    """Configuration for LoRA fine-tuning."""
    r: int = 16
    lora_alpha: int = 32
    target_modules: List[str] = None
    lora_dropout: float = 0.0

    def __post_init__(self):
        if self.target_modules is None:
            self.target_modules = [
                "q_proj", "k_proj", "v_proj", "o_proj",
                "gate_proj", "up_proj", "down_proj",
            ]

Loading the Model

from unsloth import FastLanguageModel

# Create configurations
model_config = ModelConfig()
lora_config = LoRAConfig()

# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_config.model_name,
    max_seq_length=model_config.max_seq_length,
    load_in_4bit=model_config.load_in_4bit,
    dtype=model_config.dtype,
)

# Apply LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=lora_config.r,
    target_modules=lora_config.target_modules,
    lora_alpha=lora_config.lora_alpha,
    lora_dropout=lora_config.lora_dropout,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

# Check parameter counts
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable:,} / {total:,} ({trainable/total*100:.2f}%)")

Part 3: The 2048 Environment

Game Overview

2048 is a sliding puzzle game where you combine tiles to reach 2048.

Actions:

0 = UP
1 = RIGHT
2 = DOWN
3 = LEFT

Goal: Create a tile with value 2048 (or higher!)

Connecting to the Environment

from envs.openspiel_env import OpenSpielEnv, OpenSpielAction

# Connect to 2048 environment
# Option 1: From Hub
env = OpenSpielEnv.from_hub("openenv/openspiel-env")

# Option 2: From running server
# env = OpenSpielEnv(base_url="http://localhost:8000")

# Test connection
with env:
    result = env.reset()
    print(f"Game started!")
    print(f"Legal actions: {result.observation.legal_actions}")

    # Take a test action
    action = OpenSpielAction(action_id=0, game_name="2048")
    result = env.step(action)
    print(f"After UP: reward={result.reward}, done={result.done}")

Board Utilities

import numpy as np
from typing import List

def info_state_to_board(info_state: List[int], size: int = 4) -> List[List[int]]:
    """Convert flat info_state to 2D board."""
    return np.array(info_state, dtype=int).reshape(size, size).tolist()

def render_board(board: List[List[int]]) -> str:
    """Render board as ASCII string."""
    lines = ["+------" * len(board[0]) + "+"]
    for row in board:
        cells = [f"{v:5d}" if v > 0 else "    ." for v in row]
        lines.append("|" + " |".join(cells) + " |")
        lines.append("+------" * len(row) + "+")
    return "\n".join(lines)

def get_max_tile(board: List[List[int]]) -> int:
    """Get highest tile value."""
    return max(cell for row in board for cell in row)

Part 4: Reward Function Design

The reward function is crucial for RL. We consider:

Success: Did we reach 2048?
Progress: What’s the highest tile achieved?
Code Quality: Did the generated code execute correctly?

Reward Implementation

import math

def calculate_reward(
    max_tile: int,
    success: bool,
    code_error: bool = False
) -> float:
    """
    Calculate reward for a 2048 game outcome.

    Args:
        max_tile: Highest tile achieved (2, 4, 8, ..., 2048)
        success: Whether we reached 2048
        code_error: Whether generated code had errors

    Returns:
        Float reward value
    """
    if code_error:
        return -0.5  # Penalty for invalid code

    if success:
        return 1.0  # Full reward for winning

    # Progress reward: log scale from 0 to 0.9
    if max_tile > 0:
        progress = math.log2(max_tile) / math.log2(2048)
        return min(0.9, progress)

    return 0.0

# Test reward function
test_cases = [
    (2048, True, False, "Won!"),
    (1024, False, False, "Got to 1024"),
    (512, False, False, "Got to 512"),
    (64, False, False, "Early game"),
]

for max_tile, success, error, desc in test_cases:
    reward = calculate_reward(max_tile, success, error)
    print(f"{desc:20s} -> Reward: {reward:+.3f}")

Part 5: Strategy Generation

We’ll train the model to generate Python strategy functions.

Prompt Template

SYSTEM_PROMPT = """You are an expert at playing 2048. Generate a Python function
that takes a board state and returns the best action (0=UP, 1=RIGHT, 2=DOWN, 3=LEFT).

The board is a 4x4 list of integers. Empty cells are 0.
Your function should analyze the board and return an optimal move.
"""

def create_prompt(board: List[List[int]]) -> str:
    """Create prompt for strategy generation."""
    board_str = "\n".join(str(row) for row in board)
    return f"""{SYSTEM_PROMPT}

Current board:
{board_str}

Generate a strategy function:
```python
def strategy(board):
    # Your code here
    return action  # 0, 1, 2, or 3
```"""

Executing Generated Strategies

import ast
from typing import Callable

def extract_and_execute_strategy(
    generated_code: str,
    board: List[List[int]],
    timeout: float = 5.0
) -> tuple[int, bool]:
    """
    Extract and execute a generated strategy function.

    Returns:
        (action, success): The action to take and whether execution succeeded
    """
    try:
        # Extract code block
        if "```python" in generated_code:
            code = generated_code.split("```python")[1].split("```")[0]
        else:
            code = generated_code

        # Parse and validate AST
        tree = ast.parse(code)

        # Execute in sandbox
        namespace = {"board": board}
        exec(compile(tree, "<strategy>", "exec"), namespace)

        # Call the strategy function
        if "strategy" in namespace:
            action = namespace["strategy"](board)
            if action in [0, 1, 2, 3]:
                return action, True

        return 0, False  # Default action on failure

    except Exception as e:
        print(f"Strategy execution error: {e}")
        return 0, False

Part 6: GRPO Training

GRPO (Group Relative Policy Optimization) is optimized for language models.

Training Configuration

from trl import GRPOConfig, GRPOTrainer

grpo_config = GRPOConfig(
    # Learning rate
    learning_rate=2e-6,

    # Batch sizes
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,

    # Training duration
    max_steps=200,

    # Memory optimization
    bf16=True,
    gradient_checkpointing=True,

    # Logging
    logging_steps=1,
    output_dir="./2048_grpo_output",
    report_to="none",
)

Training Loop

def train_2048_agent(
    model,
    tokenizer,
    env,
    config: GRPOConfig,
    num_episodes: int = 100,
):
    """
    Train the model to play 2048 using GRPO.
    """
    # Prepare model for training
    FastLanguageModel.for_training(model)

    training_data = []

    for episode in range(num_episodes):
        # Reset environment
        result = env.reset()
        board = info_state_to_board(result.observation.info_state)

        episode_reward = 0
        steps = 0

        while not result.done and steps < 1000:
            # Generate strategy
            prompt = create_prompt(board)
            inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

            outputs = model.generate(
                **inputs,
                max_new_tokens=256,
                temperature=0.7,
                do_sample=True,
            )

            generated = tokenizer.decode(outputs[0], skip_special_tokens=True)

            # Execute strategy
            action, success = extract_and_execute_strategy(generated, board)

            # Take action in environment
            env_action = OpenSpielAction(action_id=action, game_name="2048")
            result = env.step(env_action)

            # Update board
            board = info_state_to_board(result.observation.info_state)
            episode_reward += result.reward if result.reward else 0
            steps += 1

        # Calculate final reward
        max_tile = get_max_tile(board)
        final_reward = calculate_reward(max_tile, max_tile >= 2048)

        # Store for training
        training_data.append({
            "prompt": prompt,
            "response": generated,
            "reward": final_reward,
        })

        if episode % 10 == 0:
            print(f"Episode {episode}: Max tile={max_tile}, Reward={final_reward:.3f}")

    return training_data

Part 7: Deployment

After training, save and deploy your model.

Saving the Model

# Save LoRA adapters only
model.save_pretrained("./2048_strategy_model")
tokenizer.save_pretrained("./2048_strategy_model")

# Save merged model for inference
model.save_pretrained_merged(
    "./2048_strategy_model_merged",
    tokenizer,
    save_method="merged_16bit",
)

Push to Hugging Face Hub

# Push to Hub
model.push_to_hub(
    "your-username/2048-strategy-model",
    tokenizer,
    save_method="merged_16bit",
    private=False,
)

print("Model deployed to: huggingface.co/your-username/2048-strategy-model")

Using the Trained Model

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load trained model
model = AutoModelForCausalLM.from_pretrained("your-username/2048-strategy-model")
tokenizer = AutoTokenizer.from_pretrained("your-username/2048-strategy-model")

# Generate strategy
def get_action(board: List[List[int]]) -> int:
    prompt = create_prompt(board)
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=256)
    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
    action, _ = extract_and_execute_strategy(generated, board)
    return action

# Play a game
with OpenSpielEnv.from_hub("openenv/openspiel-env") as env:
    result = env.reset()
    board = info_state_to_board(result.observation.info_state)

    while not result.done:
        action = get_action(board)
        result = env.step(OpenSpielAction(action_id=action, game_name="2048"))
        board = info_state_to_board(result.observation.info_state)

    print(f"Final max tile: {get_max_tile(board)}")

Preventing Reward Hacking

Be aware of potential reward hacking strategies:

Code that modifies rewards - Run in sandboxed environment
Infinite loops - Set execution timeouts
Memory exhaustion - Limit resource usage

import resource
import signal

def safe_execute(code: str, board: List[List[int]], timeout: float = 5.0) -> int:
    """Execute strategy with safety limits."""

    def handler(signum, frame):
        raise TimeoutError("Strategy timed out")

    # Set timeout
    signal.signal(signal.SIGALRM, handler)
    signal.alarm(int(timeout))

    try:
        # Set memory limit (100MB)
        resource.setrlimit(resource.RLIMIT_AS, (100 * 1024 * 1024, -1))

        # Execute in restricted namespace
        namespace = {"board": board, "__builtins__": {"len": len, "max": max, "min": min}}
        exec(code, namespace)

        return namespace.get("strategy", lambda b: 0)(board)
    finally:
        signal.alarm(0)

Summary

In this tutorial, you learned:

Model Setup: Loading LLMs with Unsloth and LoRA
Environment Connection: Using OpenEnv’s 2048 environment
Reward Design: Creating balanced reward functions
GRPO Training: Training with reinforcement learning
Deployment: Saving and sharing trained models

Next Steps

Try different model architectures
Experiment with reward function designs
Train on other OpenEnv environments
Share your trained models on Hugging Face Hub!

Related Resources

Update on GitHub