- swahili-gpt-71m
- Architecture
- Training data (~2.15B tokens, all real Swahili)
- Training loss trajectory
- Intended use
- Benchmarks
- Limitations
- Roadmap
- A manifesto β building Swahili AI, together
- Why build from scratch, not just adapt an existing model?
- The numbers behind the gap
- "But ChatGPT / Gemini / Claude already speak Swahili β why bother?"
- Credits & acknowledgments
- License
- Architecture
swahili-gpt-71m
A 71M-parameter, GPT-style decoder-only Transformer for Kiswahili, trained from scratch (no pre-trained weights) on ~2.15B tokens of real Swahili text.
β try it in your browser, no setup.
β οΈ Work in progress. This snapshot is ~23% through training (val loss β 3.86, iter 120k). It already produces fluent, grammatical Swahili, but it is a base text-completion model, not a chat/instruction model yet. Newer checkpoints will be pushed as training continues. See the roadmap below.
π¬ Want to ask it questions / chat? This is a base completion model (it continues text, it doesn't answer prompts). For an instruction-following version, use Benjamin-png/swahili-gpt-71m-instruct β same model, fine-tuned to follow instructions in Kiswahili.
Architecture
| Type | Decoder-only Transformer (GPT-2 style), autoregressive |
| Parameters | ~71.7M |
| Layers / heads | 12 / 8 (head dim 64) |
| Hidden / FFN | 512 / 2048 (GELU) |
| Norm / positions | Pre-norm LayerNorm / learned absolute (max 2048) |
| Tokenizer | SentencePiece BPE, 32k vocab, byte_fallback=True (0% <unk>) |
Training data (~2.15B tokens, all real Swahili)
- Inkuba-Mono-Swahili
- Mendeley Swahili Corpus
- A local cleaned Swahili corpus
The 32k tokenizer was trained on 30M cased sentences across these sources, so it handles capitalization, proper nouns, and arbitrary characters (via byte fallback).
Training loss trajectory
Cross-entropy loss (all tokens) over pre-training iterations:
| Final train loss | ~4.66 |
| Best val loss | 3.83 (β perplexity 46) @ iter 133k |
The loss falls steeply in the first ~20k iters, then settles into a long slow decline as it approaches this 71M model's capacity. (Val is estimated on a small held-out split, so it reads a touch lower than the running train loss.)
Intended use
A base / text-completion model β give it the start of some text and it continues in Swahili:
- β
"Habari za leo, leo nataka kuzungumza kuhusu"β continues fluently - β
"Nini mji mkuu wa Tanzania?"β it will not reliably answer (not instruction-tuned yet)
βΆοΈ Try it now in Colab (no setup): https://colab.research.google.com/drive/1TFHpHoUiIDl93z9zaHHegwIizHpXjUIw?usp=sharing
Or locally:
pip install torch sentencepiece huggingface_hub
from huggingface_hub import hf_hub_download
REPO = "Benjamin-png/swahili-gpt-71m"
# Pull the model definition into the working dir so it can be imported
# (this is a custom architecture, not a built-in `transformers` model).
hf_hub_download(REPO, "modeling_kiswahili.py", local_dir=".")
import sentencepiece as spm
from modeling_kiswahili import load_model, generate
weights = hf_hub_download(REPO, "pytorch_model.pt")
hf_hub_download(REPO, "model_config.json") # lands next to the weights
tok = hf_hub_download(REPO, "swahili_tokenizer.model")
model, cfg = load_model(weights, device="cpu") # or "cuda"
sp = spm.SentencePieceProcessor(); sp.load(tok)
print(generate(model, sp, "Elimu ni muhimu kwa sababu", device="cpu"))
Or just run the included inference.py (it does the same).
Benchmarks
Coming soon. Evaluations are being prepared to rank this model's Swahili performance against other open-source models (perplexity on held-out Swahili, plus task benchmarks as instruction-tuned checkpoints land). Results will be published here and updated as training progresses β so progress is measured, not claimed.
Limitations
Early checkpoint; ~71M params means little world knowledge and no reliable facts or reasoning. Base model: no instruction following, may repeat, drift, or hallucinate. Web-scraped data carries some noise and occasional non-Swahili text.
Roadmap
This 71M model is a first step, a proof that a useful Swahili language model can be built from scratch on modest, consumer hardware (it is trained on a single 4 GB GPU). The path forward:
- Finish base pre-training on the full ~2.15B-token corpus.
- Instruction fine-tuning β a genuinely conversational Swahili assistant (prompt β response, chat format).
- Domain adaptation β specialise for tasks: education, health, agriculture, law, government services, customer support.
- Bilingual SwahiliβEnglish scaling, and eventually multimodal voice (pairing with TTS / STT) for spoken interfaces.
- Scale with data + compute β a ~8B-parameter model trained on 5β10 trillion tokens would be a genuinely strong small LLM for the region. That needs community data and compute, not magic.
A manifesto β building Swahili AI, together
Over 200 million people speak Swahili, yet the technology that is reshaping the world barely speaks it back. Mainstream language models are built β overwhelmingly β on English and a handful of high-resource languages. African languages are treated as an afterthought: under-represented, under-resourced, and rarely built by the communities that speak them. The result is a widening gap. As AI accelerates, the languages it cannot serve fall further behind.
This project is a small refusal of that status quo.
It says: we can build our own. Not by waiting for a large lab to notice us, but by assembling open Swahili data, writing the training code in the open, and running it on the hardware we actually have. A 71M model on a 4 GB laptop GPU is not the destination β it is proof that the door is open, and an invitation to walk through it together.
Imagine what becomes possible when the language is first-class in technology:
- Education β tutors and explainers that teach in Kiswahili, for students who learn best in their own language.
- Health & agriculture β assistants that answer real questions for real people, in the words they use every day.
- Access β semantic search, summarization, and translation that finally index and understand Swahili content.
- Voice β efficient TTS and STT so people can speak and be heard by machines, not just type in a second language.
- Small, efficient models that run on phones and cheap hardware β because inclusion that requires a data center is not inclusion.
These are not far-off dreams; each is a concrete, buildable project: a semantic search index, a fast Swahili TTS voice, a robust STT model, a small instruction-tuned LLM. Every one is an opening for local innovation, local ownership, and local jobs.
A call to the community. If you are a speaker β share and label data. A developer β improve the code, the tokenizer, the evaluations. A researcher β study what works for agglutinative, low-resource languages. An institution with data or compute β this is where it matters most. A creator β build the apps that put this in people's hands.
Open-source is how we move faster than the gap grows. Take this model, break it, improve it, fork it, fine-tune it, ship something with it. Push the language forward. The future of Swahili in technology will be built by the people who speak it β let's build it in the open.
Pamoja tunaweza. Tujenge teknolojia inayozungumza lugha yetu. (Together we can. Let us build technology that speaks our language.)
Why build from scratch, not just adapt an existing model?
A fair question: why not take an open foundation model (Llama, Mistral, β¦) and fine-tune it for Swahili? Because adaptation inherits the wrong foundations:
Vocabulary / tokenizer β meaning first, then efficiency. Their tokenizers are built for English and a few high-resource languages. Swahili is agglutinative: meaning is built inside single words from stacked morphemes β
hawakuweza(ha-wa-ku-weza, "they were not able"),tunaomba(tu-na-omba, "we are asking"). An English-first tokenizer shatters these into arbitrary subword fragments that don't line up with the morphemes, so the model never sees the word as a unit of meaning β it has to reassemble sense from broken pieces.- The first cost is linguistic: the model fails to capture the essence and structure of the language correctly β the very thing that should be its focus.
- The second cost is mechanical: more tokens per sentence β longer sequences, more compute and memory, smaller effective context, higher inference cost β a permanent tax you pay on every token, forever.
A tokenizer built for Swahili (like this 32k one) keeps words intact and encodes the language both accurately and efficiently.
Focus / priors. A model pre-trained overwhelmingly on English has its representations, world model, and "instincts" shaped by English. Swahili is a thin layer bolted on top, competing for capacity it never owned. Built from scratch, the entire model is Swahili β every parameter works for the language, its morphology, idiom, and context.
Accuracy & cultural fit. Bolt-on Swahili tends to be calques of English patterns; nuance, code-switching (Kiswaenglish), and local context get lost. Native training learns the language as it is actually spoken and written.
Sovereignty & cost. A from-scratch, open model is one the community owns, understands, and can run cheaply β not a black box rented from elsewhere.
Adaptation is a useful shortcut for some tasks. But to genuinely serve the language β efficiently, accurately, and on cheap hardware β the foundation itself has to be built for it.
The numbers behind the gap
This isn't a vibe β it's in the training data of the models themselves:
- GPT-3 (OpenAI): ~93% English by word count; every other language on Earth shares the remaining ~7%. (Language Models are Few-Shot Learners)
- Llama 2 (Meta): 89.7% English. The next-largest languages are German (0.17%), French (0.16%), Swedish (0.15%)β¦ each well under 0.2%. Swahili doesn't appear at all β Meta only lists languages above 0.005%, and Swahili falls below that. Meta explicitly warned the model "may not be suitable" for non-English use. (Llama 2 paper, Table 10)
- GPT-4 / today's frontier models: training mixes are undisclosed β closed black boxes.
So a language spoken by 200M+ people is, in the foundations of modern AI, effectively a rounding error (β0%). Capability that shows up as a side-effect of massive scale is not the same as capability the community owns and controls.
"But ChatGPT / Gemini / Claude already speak Swahili β why bother?"
Fair, and true: today's frontier closed models (GPT-5.5, Gemini, Claude) are genuinely good at Swahili now. But "good at Swahili" is not the whole problem:
- They are closed black boxes β you can't inspect, audit, fine-tune, or own them.
- They can't be self-hosted or run offline β every query goes to someone else's server: online-only, metered, and subject to their price, terms, and availability.
- They offer no local deployment, no data sovereignty β no on-device option, no guarantee your data stays in your hands or your country.
Meanwhile small open LLMs are getting genuinely capable β and that is exactly where an open Swahili model matters most:
- Offline mobile apps that work with no connection and no data bundle.
- Self-hosted / on-prem deployments for institutions that cannot send data to a third party.
- Privacy-sensitive domains β health, finance, government β where data must never leave the device or the country.
- Developer experimentation β free to fork, fine-tune, and build on, with no API keys or rate limits.
- Offline assistants for low-connectivity regions, where "just call the cloud" isn't an option.
A closed model that happens to speak Swahili and an open model built for Swahili solve different problems. This project is for the second kind: owned, local, private, and free to build on.
Credits & acknowledgments
- Original project: zuck30/swahili-llm-scratch by Shadrack (Shadrackovsky) β the from-scratch Swahili LLM design, synthetic-data pipeline, tokenizer, and inference scripts this work builds on.
- PyTorch port + large-corpus, resumable training + this release: Benjamin-png.
- Data: Inkuba-Mono-Swahili (Alfaxad) and the Mendeley Swahili Corpus (CC BY 4.0).
- Lineage: the Transformer (Attention Is All You Need) and Andrej Karpathy's nanoGPT teaching/tooling.
License
CC-BY-4.0 (matching the source corpora). Please keep attribution and pay it forward.
- Downloads last month
- 147