Instructions to use FormosanBank/nllb200-formosan-zh-spm8k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use FormosanBank/nllb200-formosan-zh-spm8k with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "translation" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("translation", model="FormosanBank/nllb200-formosan-zh-spm8k")# Load model directly from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("FormosanBank/nllb200-formosan-zh-spm8k") model = AutoModelForSeq2SeqLM.from_pretrained("FormosanBank/nllb200-formosan-zh-spm8k") - Notebooks
- Google Colab
- Kaggle
nllb200-formosan-zh-spm8k
Repo: FormosanBank/nllb200-formosan-zh-spm8k
Base model: facebook/nllb-200-distilled-600M
Direction: Formosan -> Traditional Chinese
Companion reverse-direction model: FormosanBank/nllb200-zh-formosan-spm8k
This is a directional NLLB-200 distilled 600M checkpoint for FormosanBank machine translation. It uses an 8k SentencePiece vocabulary extension plus FormosanBank metadata/control tags. Use the companion model for the reverse direction.
Supported Languages
| Language | NLLB code |
|---|---|
| Traditional Chinese | zho_Hant |
| Amis | ami_Latn |
| Bunun | bnn_Latn |
| Kavalan | ckv_Latn |
| Rukai | dru_Latn |
| Paiwan | pwn_Latn |
| Puyuma | pyu_Latn |
| Thao | ssf_Latn |
| Saaroa | sxr_Latn |
| Sakizaya | szy_Latn |
| Tao / Yami | tao_Latn |
| Atayal | tay_Latn |
| Seediq | trv_Latn |
| Tsou | tsu_Latn |
| Kanakanavu | xnb_Latn |
| Saisiyat | xsy_Latn |
Input Format
This model was trained and evaluated with metadata control tags. Prefix the source text in one of the supported Formosan languages with:
<to_zh> <src_LANG> <dom_BUCKET> <dialect_DIALECT>
Example with unknown metadata:
<to_zh> <src_ami> <dom_unknown> <dialect_default> Pa'araw cingra to demak nira.
If source bucket or dialect is unknown, use <dom_unknown> and <dialect_default>. If you know the training source bucket or dialect, using the matching tag may improve quality.
Usage
Tested with transformers 4.56.x. Use the slow NllbTokenizer or AutoTokenizer.from_pretrained(model_id, use_fast=False). In transformers 4.56.x, fast-tokenizer added-token IDs can differ from the slow tokenizer IDs used during training.
For NLLB generation, keep decoder_start_token_id=tokenizer.eos_token_id and set forced_bos_token_id to the target language ID.
import torch
from transformers import AutoModelForSeq2SeqLM, NllbTokenizer
model_id = "FormosanBank/nllb200-formosan-zh-spm8k"
tokenizer = NllbTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
model.to("cuda" if torch.cuda.is_available() else "cpu")
FORMOSAN_TO_LID = {
"ami": "ami_Latn", "bnn": "bnn_Latn", "ckv": "ckv_Latn", "dru": "dru_Latn",
"pwn": "pwn_Latn", "pyu": "pyu_Latn", "ssf": "ssf_Latn", "sxr": "sxr_Latn",
"szy": "szy_Latn", "tao": "tao_Latn", "tay": "tay_Latn", "trv": "trv_Latn",
"tsu": "tsu_Latn", "xnb": "xnb_Latn", "xsy": "xsy_Latn",
}
def translate_formosan_to_chinese(text: str, lang_code: str, source_bucket: str = "unknown", dialect: str = "default") -> str:
tokenizer.src_lang = FORMOSAN_TO_LID[lang_code]
prompt = f"<to_zh> <src_{lang_code}> <dom_{source_bucket}> <dialect_{dialect}> {text}"
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=384).to(model.device)
outputs = model.generate(
**inputs,
forced_bos_token_id=tokenizer.convert_tokens_to_ids("zho_Hant"),
decoder_start_token_id=tokenizer.eos_token_id,
max_new_tokens=128,
num_beams=4,
no_repeat_ngram_size=3,
repetition_penalty=1.15,
length_penalty=1.0,
early_stopping=True,
)
return tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(translate_formosan_to_chinese("Pa'araw cingra to demak nira.", "ami"))
Training Setup
| Setting | Value |
|---|---|
| Corpus | FormosanBank Chinese Parallel Corpus, leakage-controlled in-domain hard split |
| Direction | f2zh |
| Base model | facebook/nllb-200-distilled-600M |
| Tokenizer | 8k Formosan SentencePiece extension |
| Steps | 300,000 |
| Batch size | 16 |
| Gradient accumulation | 4 |
| Effective batch size | 64 |
| Max sequence length | 384 |
| Learning rate | 2e-05 |
| Warmup steps | 4,000 |
| Precision | bf16 |
| Label smoothing | 0.1 |
| Easy-source weight | 0.05 |
| Language sampling alpha | 0.5 |
| Metadata tags | enabled and validated as single tokenizer IDs |
This repo publishes the final 300k-step checkpoint because it scored highest on the held-out hard test set among the evaluated final/best checkpoints.
Evaluation
Evaluation used the held-out in_domain_hard test split with no normalized source, target, or pair overlap against train. These scores are intentionally lower than leaky or near-duplicate splits and are intended as a harder MT benchmark.
Global Metrics
| Direction | Samples | BLEU | chrF2 | TER |
|---|---|---|---|---|
| Formosan -> Traditional Chinese | 37,435 | 9.79 | 11.77 | 109.29 |
Per-Language Metrics
| Language | Code | Samples | BLEU | chrF2 | TER |
|---|---|---|---|---|---|
| Amis | ami_Latn |
4,866 | 9.68 | 11.26 | 107.56 |
| Bunun | bnn_Latn |
3,223 | 9.44 | 10.52 | 105.04 |
| Kavalan | ckv_Latn |
1,832 | 14.27 | 15.11 | 108.67 |
| Rukai | dru_Latn |
3,846 | 5.57 | 8.57 | 108.67 |
| Paiwan | pwn_Latn |
3,221 | 7.86 | 9.95 | 106.33 |
| Puyuma | pyu_Latn |
2,225 | 15.77 | 16.93 | 124.55 |
| Thao | ssf_Latn |
1,180 | 14.73 | 16.36 | 106.69 |
| Saaroa | sxr_Latn |
1,106 | 9.17 | 12.79 | 108.07 |
| Sakizaya | szy_Latn |
1,601 | 11.42 | 14.83 | 104.58 |
| Tao / Yami | tao_Latn |
1,455 | 6.93 | 10.38 | 110.85 |
| Atayal | tay_Latn |
4,293 | 6.83 | 9.07 | 108.89 |
| Seediq | trv_Latn |
4,573 | 10.09 | 11.96 | 109.35 |
| Tsou | tsu_Latn |
1,250 | 8.56 | 11.42 | 126.84 |
| Kanakanavu | xnb_Latn |
1,552 | 12.96 | 15.19 | 107.91 |
| Saisiyat | xsy_Latn |
1,212 | 14.45 | 16.32 | 106.80 |
Full source-bucket and length-bin breakdowns are available in eval/metrics.json.
Intended Use
- Research, teaching, and prototyping for Formosan-language MT.
- Draft translation assistance where review by knowledgeable speakers is available.
- Comparative evaluation of low-resource MT methods on leakage-controlled FormosanBank splits.
Limitations
- Outputs can be incorrect, ungrammatical, incomplete, or culturally inappropriate.
- Generation into Formosan languages is especially difficult and should be treated as draft-only.
- This model is not suitable for legal, medical, safety-critical, or authoritative community-facing use without expert review.
- Evaluation uses a hard split; BLEU should not be compared directly to older leaky or near-duplicate split results.
License
Released under cc-by-nc-4.0. Some underlying corpus sources may carry additional restrictions. Use this model only for non-commercial research and educational purposes unless you have confirmed broader rights for your use case.
Citation
@misc{formosanbank_nllb200_formosan_zh_spm8k,
title = {nllb200-formosan-zh-spm8k: Directional NLLB-200 MT for the FormosanBank Chinese Parallel Corpus, leakage-controlled in-domain hard split},
author = {FormosanBank contributors},
year = {2026},
url = {https://huggingface.co/FormosanBank/nllb200-formosan-zh-spm8k}
}
- Downloads last month
- 91
Model tree for FormosanBank/nllb200-formosan-zh-spm8k
Base model
facebook/nllb-200-distilled-600MSpace using FormosanBank/nllb200-formosan-zh-spm8k 1
Collection including FormosanBank/nllb200-formosan-zh-spm8k
Evaluation results
- BLEU on FormosanBank Chinese Parallel Corpus, leakage-controlled in-domain hard splitself-reported9.790
- chrF2 on FormosanBank Chinese Parallel Corpus, leakage-controlled in-domain hard splitself-reported11.770
- TER on FormosanBank Chinese Parallel Corpus, leakage-controlled in-domain hard splitself-reported109.290