Instructions to use tiny-aya-translate/tr-hi-s2st-v0.2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use tiny-aya-translate/tr-hi-s2st-v0.2 with PEFT:
Task type is invalid.
- Moshi
How to use tiny-aya-translate/tr-hi-s2st-v0.2 with Moshi:
# pip install moshi # Run the interactive web server python -m moshi.server --hf-repo "tiny-aya-translate/tr-hi-s2st-v0.2" # Then open https://localhost:8998 in your browser
# pip install moshi import torch from moshi.models import loaders # Load checkpoint info from HuggingFace checkpoint = loaders.CheckpointInfo.from_hf_repo("tiny-aya-translate/tr-hi-s2st-v0.2") # Load the Mimi audio codec mimi = checkpoint.get_mimi(device="cuda") mimi.set_num_codebooks(8) # Encode audio (24kHz, mono) wav = torch.randn(1, 1, 24000 * 10) # [batch, channels, samples] with torch.no_grad(): codes = mimi.encode(wav.cuda()) decoded = mimi.decode(codes) - Notebooks
- Google Colab
- Kaggle
- TinyAya β TurkishβHindi Speech-to-Speech Translation (v0.2)
TinyAya β TurkishβHindi Speech-to-Speech Translation (v0.2)
β οΈ Research preview β this checkpoint overfit. Released for transparency and to document the full training trajectory. The recommended weights are
best_by_val(step 1,000), not the final step-15,000 checkpoint. See Evaluation & the overfitting story.
β οΈ Dataset disclosure β honest correction
This checkpoint was trained on
tiny-aya-translate/fleurs-tr-hi-mimi-encodedβ Mimi-encoded FLEURS TurkishβHindi read speech (~27% real FLEURS audio + ~73% multi-voice TTS over FLEURS text; β8.3k train / 929 val).This was not the dataset we intended to train on. The project's synthetic data pipeline β FLORES + OPUS-100 + machine-translated conversational text, rendered with multi-voice TTS into
tiny-aya-translate/tr-hi-mimi-encoded(~1.3M clips) β is the corpus our accompanying write-up describes. The training launcher'sHF_DATASETdefault pointed at thefleurs-sibling repo, so v0.2 silently trained on FLEURS read speech instead of the synthetic conversational corpus. We are disclosing this openly rather than quietly re-labeling the run.v0.3 corrects the data source (β
tr-hi-mimi-encoded) together with the codebase fixes (parallel-stream collator, regularization, deep-codebook weighting). Seetr-hi-s2st-v0.3.
Moshi-style simultaneous speech-to-speech translation for Turkish β Hindi: a LoRA-fine-tuned Cohere2 backbone fused with a frozen Moshi depth decoder, operating on Mimi audio codes in a parallel two-stream format.
- Developed by: tiny-aya-translate
- Funded by: Google TPU Research Cloud (TRC) β see Acknowledgements
- Model type: parallel two-stream S2ST (Cohere2 + LoRA β CB0; frozen Moshi depth decoder β CB1β7)
- Languages: Turkish (
tr), Hindi (hi) - License: Apache-2.0 (for this adapter + training code; see the base-model caveat below)
- Finetuned from:
CohereLabs/tiny-aya-base(+ frozen Moshi / Mimi from Kyutai)
Provenance: sweep β production run
This recipe was selected by a proxy-first W&B hyperparameter sweep (8 Bayesian
- hyperband trials), then trained to 15k steps on a single TPU v6e-8.
- π Sweep: https://wandb.ai/cataluna84/tinyaya-stage2-tpu/sweeps/9ba8h0ho
- π Training run: https://wandb.ai/cataluna84/tinyaya-stage2-tpu/runs/t1840nkd
- Code: the training repo (PR #8)
Training procedure
| Hardware | Google TPU v6e-8 (1 host Γ 8 chips), SPMD / FSDPv2-LoRA, bf16 |
| Duration | 15,000 steps in ~24.9 h (~5.1 s/step), single continuous run |
| Effective batch | 256 (batch 8 Γ grad-accum 4 Γ 8 chips), max_frames 300 |
| Stability | 0 non-finite / NaN / loss-spike alerts across the whole run |
| Recipe | lora.r=64, lora.alpha=128, lr_lora=4.6e-4, lr_depth=1.1e-4, text_weight=0.2, warmup=500, weight_decay=0.01 |
| Data | tiny-aya-translate/fleurs-tr-hi-mimi-encoded (Mimi-encoded FLEURS TRβHI read speech) β β οΈ not the intended synthetic corpus (see Dataset disclosure above) |
Evaluation & the overfitting story
The run was mechanically flawless but the recipe overfit: validation loss bottomed at step 1,000 and rose monotonically while train loss kept falling.
| step | train loss | val/loss | cb0 val acc |
|---|---|---|---|
| 1000 | 5.475 | 2.859 β best | 13.9% |
| 5000 | 3.231 | 3.376 | 14.3% |
| 10000 | 1.887 | 4.060 | 13.8% |
| 15000 | 1.566 | 4.197 (worst) | 13.8% |
Final per-codebook val accuracy: cb0 13.8%, cb1 3.9%, cb2 1.9%, cb3 1.4%,
cb4 0.8%, cb5 0.8%, cb6 0.6%, cb7 0.5%. The text stream effectively memorized
(train text loss β 0.39); audio is the bottleneck. Likely cause: the proxy
sweep optimized short-horizon val/audio_loss, which favored high capacity
(lora_r=64) that then memorized the train set over 15k steps.
Downstream metrics (to be measured)
No speech-quality metrics have been computed yet. Planned evaluation, with the
checkpoint to use = best_by_val (step 1,000):
| Metric | Measures | Status |
|---|---|---|
| ASR-BLEU (Whisper-Lv3 β SacreBLEU) | translation quality | β TODO |
| ASR-chrF / chrF++ | quality, more ASR-roundtrip-robust than BLEU | β TODO |
| BLASER 2.0 | text-free S2ST quality | β optional |
| COMET / COMET-Kiwi | semantic adequacy | β optional |
| DNSMOS / UTMOS / NISQA | audio naturalness (MOS) | β TODO |
| WER | intelligibility | β TODO |
Reporting protocol (when filled): ASR backend + version, decoding (greedy/beam), seed, text normalization, and number of eval samples will be stated.
Checkpoints in this repo
keep_last_n rotation during the run means only these survived (steps
2,000β12,000 were not retained β fixed for future runs):
| Folder | Step | val/loss | Use |
|---|---|---|---|
best_by_val/ |
1,000 | 2.859 | β recommended (full resumable state) |
checkpoints/step_13000/ |
13,000 | ~4.19 | overfit (trajectory) |
checkpoints/step_14000/ |
14,000 | ~4.19 | overfit (trajectory) |
checkpoints/step_15000/ |
15,000 | 4.197 | overfit (trajectory) |
checkpoints/step_15000_final/ |
15,000 | 4.197 | canonical final (overfit) |
Each checkpoint contains the composite components β depth_decoder.pt,
text_embed.pt, audio_heads.pt, model_audio_embed.pt, projection.pt,
metadata.json β plus the LoRA adapter under peft_adapter/
(adapter_model.safetensors + adapter_config.json). train_15k.log is the
full training log.
Intended uses & limitations
- Intended: research, reproduction, and studying the TRβHI S2ST + overfitting trajectory. A teaching artifact for sweepβrunβdiagnosis.
- Out of scope / limitations: not production-ready β overfit, low per-codebook accuracy, no human eval, narrow domain (FLEURS read speech). May hallucinate or produce low-naturalness audio. Do not deploy for real translation without re-training (see below).
How to use
This is a composite model (custom architecture), not a drop-in
transformers pipeline. Load via the training repo's
src/model/composite.py: base Cohere2 backbone + the LoRA adapter
(peft_adapter/) + the .pt components, then decode Mimi codes to audio.
See the repo README
for the loading + inference path. Use the best_by_val folder.
Recommended next run (fixing the overfit)
Lower capacity + regularization + early stopping: lora_r 16β32, higher
weight_decay/dropout, stop at ~1β2k steps (or use best-N checkpoint
averaging), and/or more/augmented data.
Bias, risks & limitations
Trained on FLEURS (read speech, limited speakers/domains); quality and fairness across dialects, accents, code-switching, and spontaneous speech are untested. Speech translation can mistranslate, omit, or fabricate content β outputs must not be relied upon for high-stakes communication.
License caveat (important)
The Apache-2.0 license here covers this LoRA adapter and the training
code. The model is built on CohereLabs/tiny-aya-base and Moshi/Mimi β your
use of the assembled model is governed by those upstream licenses (the Aya
family is often released under non-commercial terms). Check the base-model and
Moshi/Mimi licenses before any commercial use.
Acknowledgements
This model was trained on Cloud TPU v6e-8 hardware generously provided by Google's TPU Research Cloud (TRC) program. We thank the TRC team for supporting this research.
Citation
@misc{tinyaya_tr_hi_s2st_v0_2,
title = {TinyAya: Turkish-Hindi Speech-to-Speech Translation (v0.2)},
author = {tiny-aya-translate},
year = {2026},
note = {Cohere2 + frozen Moshi depth decoder, LoRA, trained on Google TRC TPU v6e-8},
url = {https://huggingface.co/tiny-aya-translate/tr-hi-s2st-v0.2}
}
- Downloads last month
- -
Model tree for tiny-aya-translate/tr-hi-s2st-v0.2
Base model
CohereLabs/tiny-aya-base
Task type is invalid.