Caro5: Day 5: Generating dataset and first try at ONNX

Community Article
Published June 12, 2026

Training on bad data: Fixing bias, scaling network from 32→64 channels and 2→6 blocks (top1 41%→47%), but now my laptop is maxed out.

┌───────┬──────────┬───────────┬──────────┬────────┬────────┐
│ epoch │ valLoss  │ valPolicy │ valValue │ top1   │ top3   │
├───────┼──────────┼───────────┼──────────┼────────┼────────┤
│ 1     │ 2.914763 │ 2.90384   │ 0.010923 │ 0.3585 │ 0.5571 │
│ 2     │ 2.717413 │ 2.707082  │ 0.010331 │ 0.3804 │ 0.6013 │
│ 3     │ 2.635484 │ 2.625218  │ 0.010266 │ 0.4000 │ 0.6126 │
│ 4     │ 2.639141 │ 2.628439  │ 0.010702 │ 0.4053 │ 0.6218 │
│ 5     │ 2.613928 │ 2.602872  │ 0.011056 │ 0.4005 │ 0.6147 │
│ 6     │ 2.616899 │ 2.60628   │ 0.010619 │ 0.4056 │ 0.625  │
│ 7     │ 2.615699 │ 2.604882  │ 0.010816 │ 0.4088 │ 0.6215 │
│ 8     │ 2.606242 │ 2.59518   │ 0.011061 │ 0.3976 │ 0.6242 │
│ 9     │ 2.588667 │ 2.577354  │ 0.011313 │ 0.4022 │ 0.6171 │
│ 10    │ 2.589954 │ 2.578606  │ 0.011348 │ 0.4111 │ 0.6286 │
│ 11    │ 2.613155 │ 2.601768  │ 0.011387 │ 0.4073 │ 0.629  │
│ 12    │ 2.609741 │ 2.598176  │ 0.011565 │ 0.4063 │ 0.6157 │
└───────┴──────────┴───────────┴──────────┴────────┴────────┘

During training, the neural network has two outputs or "heads": the policy head (which suggests promising moves) and the value head (which estimates the win probability from the current position).

valPolicy loss measures how wrong the policy head's move probabilities are compared to the actual moves selected by MCTS—lower is better.

valValue loss measures how wrong the value head's win probability estimate is compared to the actual game outcome.

Top1 accuracy is the percentage of times the policy head's single highest-probability move matches the MCTS-chosen best move, while Top3 accuracy is the percentage of times the best move is among the three highest-probability moves. Both are intuitive measures of policy quality.

Epoch 10 is the only plausible alternate checkpoint: nearly identical validation loss to epoch 9, but the best top-1 and top-3 policy accuracy..

Positive signs

Top1 accuracy improved from 35.8% → 40–41% — the model is learning to pick better moves.

Top3 accuracy ~62–63% — in 3 tries, it picks the right move ~60% of the time.

valPolicy loss steadily decreasing (2.90 → 2.58) — policy learning is working.

Concerning signs

valValue loss is tiny (~0.011) and flat — this suggests the model is not learning to predict game outcomes well. It just predicts a constant or near-constant value (probably ~0.0).

valLoss is dominated by policy loss (2.58 of 2.59 total), meaning value head is contributing almost nothing.

Audit & Mitigation

Seat/color counts are basically balanced, but value means are biased by actor color.

Unbalanced train:

  • actorColor 0: 43,520, value mean -0.044516
  • actorColor 1: 43,580, value mean 0.044910
  • value gap: 0.089426

Mitigated dataset balanced by actorColorValueSign:

  • total examples: 65,468
  • train: 59,379
  • val: 6,089
  • train actorColor value gap dropped to about 0.041674

Next Steps

  1. Big mistake when preparing the dataset. Color must not be taken into account. I used it to label and now it turned into a signal for the network. Value should be from the example’s current actor/current-player perspective:
  +1 = current player wins
   0 = draw / neutral
  -1 = current player loses

Color can be an input feature, but it should not define the value target.

  1. flip/rotate boards for symmetry augmentation. For geometric flips/rotations, transform board planes and policy coordinates together.

  2. The mitigation helped, but yes, still substantial. The train actorColor value gap went from about 0.0894 to about 0.0417. That is better, but not “fixed.” It still suggests either distribution skew or a remaining target issue.

  3. Set value loss weight default to 5.

  4. combined loss is now: loss = policy_loss + value_loss_weight * value_loss

Improvements

Screenshot 2026-06-12 at 17-26-27 Bias mitigation metrics visualization - Claude

That's a dramatic improvement. The gap dropped from 0.089 → 0.010, an 89% reduction. This is qualitatively different from the resampling attempt.

One more run

Previous iter2 epoch 10: 19W / 6D / 25L = 44% 
Outcome-value  epoch 10: 18W / 6D / 26L = 42% 
Outcome-value  epoch 12: 17W / 6D / 27L = 40% 

This is a concerning result. The outcome-value model is weaker overall, and the role split tells me exactly why.

Look at the epoch 10 → epoch 12 swing:

Opener: 30% → 42% (massive gain)
Chooser: 54% → 38% (massive loss)

What to do now and how to teach bot to learn

A blended approach is now clearly necessary. And reducing the weight for Value Loss to 2: more signal to learn, but not at the cost of policy.

Also the model is tiny:

  • Input feature planes: 13
  • Stem conv: 13 -> 32
  • Residual blocks: 2
  • Each residual block: two 3x3 convs with 32 -> 32
  • Policy head: 32 -> 2 via 1x1
  • Value head: 32 -> 1 via 1x1

So the trunk is very small: 32 channels, 2 residual blocks. Going forward to 64ch and 6 blocks.

Width (channels/filters) controls how many features the network can represent simultaneously at each position. 32 channels means the network has 32 "slots" to encode things like "there's an open three here", "this intersection is contested", "this direction is dangerous". 64 gives it twice as many.

Depth (blocks) controls how far the network can see. Each residual block roughly extends the receptive field by the kernel size. With 3×3 convs and 2 blocks your network effectively sees about 5×5 around each point. At 4 blocks it sees ~9×9, at 6 blocks ~13×13 — nearly the full board influence of a stone in caro.

Fix the blended targets with zero-heavy value distribution

The reason the value head keeps collapsing to neutral — 87% of the training examples have value=0. The network learns "predict zero" and gets 87% of examples approximately right.

Screenshot 2026-06-12 at 15-54-40 Model training metrics analysis - Claude

The results

| Run                  | val    |   top1 |   top3 | MSE    | Arena          | 
Iter2 vw2-32x2@epoch 7 | 2.5709 | 0.4105 | 0.6279 | 0.0870 | 11W / 11D / 28L| 
Iter2 vw2-64x6@epoch 11| 2.2997 | 0.4671 | 0.6804 | 0.0781 | 18W / 5D / 27L |

Conclusion

My poor laptop is running full speed and I’m going nowhere. The next step is to figure out how to use modal.com. The problem is, both the ui and engine are in TS, let's see.

Community

Sign up or log in to comment