Crane Nemo ASR

A streaming automatic speech recognition model for Luganda, Shona, and Swahili (with English retained), fine-tuned from nvidia/nemotron-3.5-asr-streaming-0.6b β€” a FastConformer Cache-Aware RNN-Transducer (~600M parameters). By CraneAI Labs.

The model transcribes conversational and read speech in real time (cache-aware streaming) and is conditioned on a language-ID prompt.

Results

Evaluated on a frozen, held-out set of 50 clips per language (excluded from training by both utterance ID and speaker ID). WER = word error rate, CER = character error rate (both computed under a shared normalization: NFKC, lowercased, punctuation-stripped, whitespace-collapsed), and Gemini = a meaning-preservation score from 1–5 (LLM-as-judge), which better reflects intelligibility for morphologically rich Bantu languages where word-level WER is harsh.

Language CER ↓ WER ↓ Gemini (1–5) ↑
Luganda (lg) 8.2% 33.8% 4.90
Shona (sn) 8.1% 37.7% 4.45
Swahili (sw) 8.1% 27.5% 4.15
English (en) 3.2% 7.4% 4.95

Starting point (before fine-tuning). On these African languages the base model is effectively unusable β€” roughly 100% WER: Luganda 99.7%, Shona 101.0%, Swahili 102.5% (CER 59–76%). Fine-tuning brings them to the figures above. (English is already supported by the base model and stays strong: 19.3% β†’ 7.4% WER.)

How WER/CER are calculated. Predictions and references are passed through one shared text normalizer β€” Unicode NFKC, lowercasing, punctuation removal, and whitespace collapsing β€” and then scored with jiwer, aggregated across the whole eval set (total edits Γ· total reference units), not averaged per clip.

Why normalize. Without it, WER counts casing and punctuation differences as full word errors even when the spoken content is transcribed correctly β€” and the references for these languages carry sentence-case and punctuation conventions the model isn't asked to reproduce. Normalization removes that bookkeeping penalty so the number reflects recognition quality rather than formatting. The effect is uneven (small for Swahili/English, larger for Luganda/Shona, whose references are more heavily cased/punctuated), which is exactly the formatting noise it is meant to factor out.

Note on metrics: CER and the Gemini meaning-score are the most faithful quality signals for these languages. WER runs higher than CER largely because rich agglutinative morphology and word-boundary conventions penalize the word-level metric even when the transcription is intelligible.

Training

  • Method: parameter-efficient fine-tuning β€” the encoder is frozen except its top 12 layers, which are trained together with the prediction network and joint network (~331M trainable parameters).
  • Optimizer: 8-bit AdamW, learning rate 1.2e-4, cosine schedule with 200 warmup steps, 6,000 steps.
  • Precision: fp32. Hardware: a single NVIDIA Tesla T4 (16 GB).
  • Decoding/inference: cache-aware streaming, fp32, attention context [56, 13] (~1.12 s lookahead).

Training data

Language Approx. hours
Luganda ~13.7h
Shona ~15.8h
Swahili ~9.2h
English ~2.0h (retention)

Data is read and conversational speech drawn from openly available African-language speech corpora; a small English slice is included to retain English performance.

Intended use

  • Transcription of Luganda, Shona, Swahili, and English speech, including low-latency streaming applications.
  • Suitable as a strong, deployable baseline for these languages and as a starting point for further fine-tuning.

Limitations

  • The encoder is largely frozen, so strongly out-of-domain acoustics (very noisy field recordings, far-field/distant microphones, heavy code-switching) are only partially handled.
  • Results are reported on 50 clips per language β€” directional, not a large-scale benchmark.
  • Cache-aware streaming inference is fp32-only.
  • Best results use the language-ID prompt set to auto-detect; Luganda and Shona do not have a dedicated prompt slot, so auto-detect is recommended for them.
  • Not validated for safety-critical or medical/legal transcription.

Usage

import nemo.collections.asr as nemo_asr

model = nemo_asr.models.ASRModel.restore_from("crane-nemo-asr.nemo")
transcripts = model.transcribe(["audio.wav"])
print(transcripts)

For cache-aware streaming inference, use NeMo's examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py with compute_dtype=float32, att_context_size=[56,13], and target_lang=auto.

License

This model inherits the NVIDIA Open Model License from its base model.

Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for CraneAILabs/crane-nemo-asr

Finetuned
(26)
this model