Adapting a Pre-trained LLM for Kikuyu Language Output from English Text Input
To fine-tune a pre-trained large language model to produce Kikuyu language output from English text input, running entirely on local Apple Silicon hardware with no cloud dependency during training.
Declared future phases:
- Phase 1 (understanding): Research, Environment Setup, and Dataset Verification
- Phase 2 (text): Extend to bidirectional translation (kik→en) and domain generalisation
- Phase 3 (voice): Generate Kikuyu speech output from English input, combining the text model with a TTS layer (target resource: Meta MMS
facebook/mms-tts-kik, AfriVoices-KE corpus) - Phase 4 (deployment): Optimise for low-connectivity, low-literacy contexts at scale — design parameters inherited from the broader product context of autonomous voice and digital agents powering high-volume customer interactions across African markets
- Research, Environment Setup, and Dataset Verification
- Data Preparation
- Data Preprocessing
Log
Research, Environment Setup, and Dataset Verification
1. Research Findings
A full landscape survey was conducted before any code was written.
Prior Artefacts (Most Relevant)
InterstellarCG/kikuyu-translator-final (Kelvin Ng'eno, 2026) Gemma 3 4B fine-tuned with LoRA on ~141K Kikuyu sentence pairs using a 4-stage curriculum. Final training loss: 0.746. No BLEU score published. Currently the strongest documented open-source Kikuyu LLM recipe.
Khaya AI (NLP Ghana + Harvard African Language School) Best published BLEU scores: >11 (en→kik), >16 (kik→en) on FLORES-200. Closed weights. Current community SOTA.
NLLB-200-600M-distilled (Meta, 2022) First open model with native Kikuyu (kik_Latn) pretraining. Zero-shot baseline: ~9 BLEU (en→kik), ~3 BLEU (kik→en) on FLORES-200.
Key Insight — Register Bias Problem
Approximately 80% of all available Kikuyu parallel data comes from Jehovah's Witnesses publications. Training naively on this data produces a model that translates into archaic, formal, liturgical Kikuyu. The solution is a 4-stage curriculum training schedule (documented in Entry 001, Section 4).
Dataset Confirmed
michsethowusu/english-gikuyu_sentence-pairs_mt560 — 94,242 English-Kikuyu sentence pairs, sourced from OPUS MT560 (predominantly JW publications). License: CC-BY-4.0. Column names: eng, kik.
BLEU Benchmarks — What Counts as Progress
BLEU scores are measured against FLORES-200 (en→kik direction):
- Raw LLM, no fine-tuning — <2 BLEU — Unusable
- NLLB-200 zero-shot baseline — ~9 BLEU — Gist only
- Target for this experiment, Stage 1 — ≥12 BLEU — Useful translation
- Khaya AI (current community best) — >11 BLEU — Solid
- Strong result — 15–20 BLEU — Comparable to low-resource SOTA
- SOTA Bantu MT — >20 BLEU — Requires large, diverse dataset
Note on metrics: BLEU must always be paired with chrF++ for Kikuyu. Kikuyu is agglutinative — one concept can produce multiple valid surface forms by adding prefixes and suffixes. BLEU penalises morphological variation that chrF++ correctly rewards. A BLEU of 10 with chrF++ of 40+ can represent genuinely useful translation.
2. Tool Stack Decisions
Training framework — MLX-LM 0.31.3 Purpose-built for Apple Silicon unified memory; 15–30% faster than PyTorch-MPS.
Base model candidate 1 — NLLB-200-600M-distilled Encoder-decoder; native Kikuyu pretraining; full fine-tune fits 24 GB; strongest baseline.
Base model candidate 2 — Gemma 3 4B Decoder-only; no Kikuyu pretraining; QLoRA via MLX; instruction-capable; proven recipe (InterstellarCG).
Inference / comparison — Ollama (Gemma4:e4b) Interactive testing against baseline; not used for training.
Evaluation — sacrebleu (spBLEU + chrF++) Standard MT evaluation; FLORES-200 kik_Latn as held-out test set.
Dataset download — HuggingFace datasets 4.8.5 Standard library for HF Hub access.
Tools Ruled Out
- Unsloth — CUDA-only; does not run on Apple Silicon (as of May 2026)
- Axolotl — CUDA-centric; MPS backend unreliable for full training workflows
- PyTorch — Not installed; replaced entirely by MLX
3. Environment Setup — Verified State
Python: 3.14.4pip: 26.0.1Virtual env: .venv (activated, project-scoped)MLX: 0.31.2MLX-Metal: 0.31.2 (Apple GPU backend — confirmed active)MLX-LM: 0.31.3Transformers: 5.8.0Datasets: 4.8.5SentencePiece: 0.2.1
GPU verification result:
Device(gpu, 0)
MLX is routing computation to the M2 GPU via Metal. CPU fallback is not occurring.
Note on PyTorch warning: transformers issues a "PyTorch not found" warning at import. This is expected and harmless. MLX-LM uses the MLX backend; PyTorch is not needed and should not be installed (would introduce slower MPS training path).
4. Dataset Inspection — Findings
Dataset: michsethowusu/english-gikuyu_sentence-pairs_mt560 Rows: 94,242 Splits: train only (no validation or test — must be created manually) Columns: eng, kik
Sample Rows (Actual Data)
"Princess" !→Mwarĩ wa Mũthamaki .He Causes to Become→No Ahingie Kĩĩranĩro o Gĩothe AngĩĩranĩraThat is where "godly devotion" comes into the picture.→"Kwĩamũrĩra Ngai" nĩ gũteithagia mũndũ kũiganĩra....Revelation 21:3,4...→...Kũguũrĩrio 21:3,4...
Observations
Observation 1 — Diacritics confirmed present. Characters ĩ and ũ appear in the raw data. Not corrupted at source.
Observation 2 — Register bias confirmed at first look. All five of the first rows are religious or JW-publication content. The 4-stage curriculum training is not optional — it is the direct response to this structure.
Observation 3 — Fragment pairs present. Row 0 ("Princess!" → "Mwarĩ wa Mũthamaki .") is a section heading, not a translatable sentence. Preprocessing must filter pairs below a minimum token length. Planned threshold: discard pairs where either side is fewer than 3 words.
Planned Train/Validation/Test Split
- Train: ~85,000 rows (90%)
- Validation: ~4,700 rows (5%)
- Test: ~4,700 rows (5%)
- FLORES-200 devtest (1,012 sentences) reserved as independent held-out evaluation set — never seen during training
5. Tokeniser Round-Trip Test — Results
Tokeniser tested: NLLB-200-distilled-600M (SentencePiece BPE, vocab size ~256,000)
Mwarĩ wa Mũthamaki .— ✅ PASSKwĩamũrĩra Ngai nĩ gũteithagia mũndũ kũiganĩra .— ✅ PASSnake nĩekũmagiria maithori mothe mathire maitho mao— ✅ PASS
All Kikuyu diacritics survived encoding and decoding without corruption. The tokeniser can be trusted for training data preparation.
Observation on token granularity: "Mũthamaki" appears to map onto a single token ID (248,221), indicating NLLB-200 has this word in its vocabulary from Kikuyu pretraining. "Kwĩamũrĩra" maps to multiple tokens — split into subword pieces. Single-token words are learned more efficiently; multi-token words require the model to reconstruct meaning from parts. This difference is a direct consequence of how much Kikuyu data was seen during NLLB-200's original pretraining.
6. NLLB-200 Knowledge Distillation — Planned Integration
Scheduled for: Stage 2 of curriculum training (Diversification phase)
NLLB-200-600M-distilled will be used as a teacher model to generate synthetic training pairs. Specifically:
- Collect Kenyan English monolingual text: agricultural extension materials (e.g., CGIAR datasets), news content, health information
- Run NLLB-200 inference locally to translate these English sentences into Kikuyu
- Spot-check synthetic pairs for quality (human review of sample)
- Fold accepted synthetic pairs into Stage 2 training data alongside CGIAR/Bloom Library material
This addresses the data scarcity problem without any architectural changes. NLLB becomes the teacher; Gemma (the student) inherits Kikuyu knowledge through examples. This technique is called knowledge distillation.
Rationale: The Stage 2 diversification pass currently has limited non-religious data (~1,000–2,000 pairs from CGIAR + Bloom). Synthetic pairs from NLLB-200 inference can expand this to 5,000–10,000 modern-register examples, which meaningfully changes the model's default register.
7. Architecture Clarification Log
The following corrections were made during conceptual development (documented from handwritten exercises):
Correction 1 — Model Type Labels
- Original (incorrect): NLLB-200 = decoder-only; Gemma 3 4B = encoder-decoder
- Corrected: NLLB-200 = encoder-decoder; Gemma 3 4B = decoder-only
- Why it matters: NLLB's encoder-decoder design is purpose-built for translation (full source comprehension before generation begins). Gemma's decoder-only design learns translation as a text completion pattern.
Conceptual Question — Hybrid Architecture
Question: Can NLLB-200's encoder be attached to Gemma's decoder?
Answer: Architecturally yes (analogous to LLaVA's vision-encoder + language-decoder design), but requires adding cross-attention layers to Gemma — research-grade work, not feasible with MLX-LM. Practical equivalent: knowledge distillation (see Section 6 above).
Conceptual Question — What Token Arrays Represent
Token IDs (e.g., [72, 1847, 248221]) are lookup keys into the embedding matrix, not the neural weight matrices themselves. Each ID retrieves a dense vector of ~1,024 numbers. These vectors pass through transformer attention layers where contextual meaning is built. LoRA adapters are inserted alongside the weight matrices in those attention layers — not in the embedding lookup. Only the adapter matrices (A and B, rank 16) are updated during fine-tuning; the rest of the model is frozen.
8. Open Questions Carried Forward
- Model choice: NLLB-200-600M-distilled vs Gemma 3 4B — decision deferred to the end of the preprocessing phase, when data volume and domain balance are clearer.
- HuggingFace authentication: Required to download Gemma 3 4B (gated model). Token setup scheduled for next session.
- Fragment threshold: 3-word minimum filter planned — exact number to be confirmed after running length distribution analysis on the full dataset.
- Synthetic data volume: How many NLLB-200-generated pairs are needed for Stage 2? Target range: 5,000–10,000. Quality threshold to be defined.
- Evaluation cadence: Run sacrebleu against FLORES-200 after every curriculum stage or only at the end? Leaning toward: after each stage, to catch regressions early.
Data preprocessing: analysis and cleaning
What was done
Ran a length distribution analysis on the raw 94,242-row dataset, then a cleaning pass that filtered fragment pairs, removed duplicates, and verified encoding integrity across the full dataset.
What was discovered
The raw dataset is cleaner than anticipated. No empty fields in either column. Only three exact duplicate pairs. Zero encoding errors across 93,859 surviving rows. Fragment pairs (380 rows, 0.4% of data) are section headings, glossary entries, and page numbers from JW publication structure, consistent with the single example identified in Phase 1.
One edge case
Row [0] of the cleaned dataset is still '" Princess " !' → 'Mwarĩ wa Mũthamaki .'. It survived the ≥3-word filter because the word count is genuinely 4. The reason is specific to JW publication typesetting: the original text places literal spaces between quotation marks and words, producing the string " Princess " ! with spaces around each punctuation character. Python's .split() splits on those spaces and returns four tokens: ", Princess, ", !. This is not a quirk of .split(). It has no awareness of punctuation and does not treat it specially. The spaces are in the source text. If the string had no spaces ("Princess"!), .split() would return one token and the filter would remove it correctly.
Accepted as a known edge case: the JW typesetting convention inflates word counts for certain pairs. More aggressive filtering (stripping punctuation before counting) would catch these but it introduces complexity disproportionate to the scale of the problem.
Correction made this session: An earlier explanation stated that .split() it treats punctuation as separate whitespace-delimited tokens. This is incorrect. The correct explanation is above. Verified by independent research.
Diacritic integrity confirmed at scale: 91,389 of 93,859 rows (97.4%) contain ĩ or ũ. Rows without these specific diacritics are valid Kikuyu text using other vowel forms.
Numbers
Fragment filter (< 3 words either side)
- Before: 94,242 — After: 93,862 — Removed: 380
Deduplication
- Before: 93,862 — After: 93,859 — Removed: 3
Total removed: 383 rows Retention rate: 99.6%
Concepts understood this session
The .fillna('') function replaces null values: this does not include whitespace. | is logical OR producing a boolean mask. It is not a piping command as one would see in a command line interface. [:60] truncates display, does not filter rows (corrected). & keeps rows where both conditions are true; | keeps rows where either is true. random_state pins the shuffle for reproducibility. .split() splits on whitespace only, and it has no awareness of punctuation.
Decisions made
Three-word threshold confirmed appropriate by examples. "Princess" edge case accepted as known noise with correct mechanism now recorded. Proceeding with 93,859 cleaned pairs.
Open questions
Whether the JW typesetting convention (spaces around punctuation) affects enough pairs to warrant a regex-based pre-filter that strips punctuation before counting. Scale unknown — not prioritised.
Next step
Step 5: train/val/test split (90% / 5% / 5%) on the 93,859 cleaned pairs.
Data Preprocessing: Splitting and JSONL conversion
What was done
Split the 93,859 cleaned pairs into train, validation, and test sets. Converted all three to JSONL format for MLX-LM training.
Split results
Train (90%)
- Rows: 84,473
Validation (5%)
- Rows: 4,693
Test (5%)
- Rows: 4,693
Total check: 84,473 + 4,693 + 4,693 = 93,859 ✓
Shuffled with random_state=42 before splitting. All future runs reproduce the same split.
JSONL format
Each entry follows the MLX-LM instruction format:
{"prompt": "Translate to Kikuyu: {eng}", "response": "{kik}"}
Diacritics verified intact across all three files. Written with ensure_ascii=False and encoding='utf-8'.
Step 6 status
CGIAR agricultural pairs and Bloom Library download deferred. These datasets feed Stage 2 Diversification training, not Stage 1 Foundation. They do not block Checkpoint 2.
Training will be a continuous process, borrowing the agile principle of "every product is a versioned prototype."
Checkpoint 1 status
Complete for Stage 1 training purposes. The three JSONL files are the only inputs required for Checkpoint 2 (model setup) and Checkpoint 3 (Stage 1 training run).
Next step
Checkpoint 2: choose between NLLB-200-distilled-600M and Gemma 3 4B. Download the chosen base model. Verify tokeniser handles Kikuyu diacritics. Run a 10-step smoke test.