Fine-tuning Whisper for Laz: An End-to-End Journey

Posted on June 8, 2026

Taruen is a language technology studio that puts equal emphasis on supporting all languages, regardless of speaker numbers. Turkey is home to many languages besides Turkish, and one of them is Laz — a Kartvelian language spoken primarily along the southeastern Black Sea coast, listed by UNESCO as definitely endangered.

While based in Istanbul, we had the privilege of meeting people from the Laz Institute, an organization founded in 2013 dedicated to preserving, developing, and revitalizing the Laz language. They do extensive work: producing textbooks, organizing language courses (Laz is now an elective in some Turkish schools and universities), publishing books through the Lazika Yayın Kollektifi, and crucially for our purposes, contributing voices to Mozilla Common Voice.

The Institute also operates the LazuriTV YouTube channel, which hosts roughly 105 hours of spoken Laz content — about 4× the data available on Common Voice (28 hours). However, most of these videos lack transcriptions or subtitles.

Our previous involvement with speech recognition was a 2021 paper on low-resource ASR for Turkic languages. Out of three motivations — sharpening our own skills, learning the latest developments in the speech-to-text field, and wanting to help the Laz Institute eventually transcribe their YouTube archive to feed back into Common Voice — we decided to fine-tune a Whisper-small ASR model on the Common Voice Laz data.

This post documents how we did it.

Why another guide?

Great fine-tuning guides already exist:

So why write another? Besides showcasing our own work, we wanted to:

Show how to fine-tune Whisper on a language it didn't support at release — Laz isn't in OpenAI's original language list
Cover the workflow end-to-end with Mozilla Data Collective as the data provider, with all fine-tuning code in a single file you can tweak
Show how to run the pipeline on a proper GPU cloud when Colab/Kaggle hit their limits — which they will

The data: Common Voice Laz via Mozilla Data Collective

Common Voice Laz has 34,909 recorded clips totaling around 28 hours. We accessed it via the Mozilla Data Collective Python SDK, which returns a pandas DataFrame with audio file paths and transcriptions plus a split column with Mozilla's official train / dev / test / validated assignments.

A crucial detail we initially missed: Mozilla speaker-separates these splits. We verified this explicitly:

# Verify zero speaker overlap across official splits
train_speakers = set(df[df['split']=='train']['speaker_id'].dropna())
test_speakers  = set(df[df['split']=='test']['speaker_id'].dropna())
dev_speakers   = set(df[df['split']=='dev']['speaker_id'].dropna())

print(f"train vs test overlap: {len(train_speakers & test_speakers)}")
print(f"train vs dev overlap:  {len(train_speakers & dev_speakers)}")
print(f"dev vs test overlap:   {len(dev_speakers & test_speakers)}")
# All zero — Mozilla's speaker separation holds

For Laz, the official splits are: train (5,051 rows, 7 speakers), dev (3,551 rows, 12 speakers), test (3,495 rows, 93 speakers), and validated (21,151 rows, 112 speakers — quality-checked but not assigned to a specific split). We used train + validated for training, dev for evaluation during training, and held test out entirely for final evaluation.

Adding Laz to Whisper's vocabulary

Whisper doesn't know about Laz. The model uses special tokens like <|en|>, <|tr|>, <|ka|> to mark the target language. To add Laz, we added a custom <|lzz|> token and resized the model's embedding layer:

new_lang_token = "<|lzz|>"
processor.tokenizer.add_tokens([new_lang_token], special_tokens=True)
model.resize_token_embeddings(len(processor.tokenizer))

We then forced the decoder to start every transcription with our custom language token:

lang_id          = processor.tokenizer.convert_tokens_to_ids(new_lang_token)
task_id          = processor.tokenizer.convert_tokens_to_ids("<|transcribe|>")
no_timestamps_id = processor.tokenizer.convert_tokens_to_ids("<|notimestamps|>")

manual_forced_ids = [(1, lang_id), (2, task_id), (3, no_timestamps_id)]
model.config.forced_decoder_ids            = manual_forced_ids
model.generation_config.forced_decoder_ids = manual_forced_ids

The Colab debacle

Our first attempt was on Google Colab Pro. The data preprocessing pipeline — which loads audio files, computes mel spectrograms, and tokenizes labels — hung indefinitely at 0% on A100 and A100 High-RAM runtimes. Same code worked on T4 instances. We initially blamed everything from PyArrow to HuggingFace datasets chunking before identifying the actual cause:

OpenMP thread pool deadlock in PyTorch's STFT operation. PyTorch's audio feature extraction uses OpenMP for parallelism, and on high-CPU instances the thread pool grew large enough to deadlock during fork operations. T4 instances escaped it because they have fewer cores and thus a smaller thread pool. The fix is one environment variable:

export OMP_NUM_THREADS=1

This is a known issue (PyTorch #17199) but not well documented in the context of Whisper feature extraction. It's also a good lesson: if your training script "just hangs" with no error on a beefier machine, suspect threading before suspecting your code.

After the deadlock fix, Colab still had issues — runtime disconnections overnight, lost checkpoints because we forgot to save to Google Drive. We pivoted to a proper GPU cloud.

Running on RunPod

We moved to RunPod. An A100 80GB SXM instance at $1.49/hour ran the full pipeline in about 5 hours, for a total of ~$8 per training run. Compared to the gymnastics required to keep Colab alive overnight, an SSH-accessible VM with tmux and proper checkpoint persistence is a different world.

Setup:

# After SSHing into the RunPod instance:
pip install transformers accelerate evaluate jiwer tensorboard \
            datasets[audio] datacollective librosa pandas pyarrow

# Set the API key
export MDC_API_KEY=your_key_here
export OMP_NUM_THREADS=1

# Run inside tmux so a dropped SSH connection doesn't kill the job
tmux new -s laz
python3 train_whisper_laz.py
# Ctrl+B then D to detach

The full training script is included at the end of this post.

First run: catastrophic overfitting

Our naive first run used a random 90/10 train/test split and no text normalization, weight decay, or audio augmentation. By step 1500, the training loss was approaching zero while validation WER had plateaued at 42%. Classic memorization.

Worse, the random 90/10 split meant the same speaker could appear in both train and test, contaminating the evaluation. The 42% number wasn't even an honest representation of generalization.

Second run: methodology improvements

For the second run we applied several improvements:

1. Official Mozilla splits: Train on train + validated, evaluate on dev, hold out test entirely. Speaker-separated by construction.

2. Text normalization: Whisper outputs are cased and punctuated. If your dataset has inconsistent punctuation, the model wastes capacity (and gets penalized in WER) for guessing punctuation correctly. We strip punctuation and lowercase before training and evaluation:

def normalize_text(text: str) -> str:
    text = text.lower()
    text = re.sub(r"[^\w\s']", "", text)   # keep word chars, whitespace, and apostrophes (for Laz ejectives like k')
    text = re.sub(r"\s+", " ", text).strip()
    return text

3. Regularization: Added weight_decay=0.01 and turned on Whisper's dropout (off by default):

model.config.attention_dropout  = 0.1
model.config.activation_dropout = 0.1

4. Speed perturbation augmentation: Roughly half the time, randomly stretch audio to 0.9× or 1.1× speed. Free data multiplication, forces the model to learn speaker-rate-invariant features:

SPEED_RATES = [0.9, 1.0, 1.0, 1.1]   # 50% no augment

rate = SPEED_RATES[np.random.randint(len(SPEED_RATES))]
audio_array, sr = librosa.load(audio_path, sr=None, mono=True)
if rate != 1.0:
    audio_array = librosa.effects.time_stretch(audio_array, rate=rate)
audio_array = librosa.resample(audio_array, orig_sr=sr, target_sr=16000)

5. Earlier stopping: Capped at 2000 steps instead of 4000, since the first run made it clear nothing useful happens after ~step 1750.

Results

The improved methodology produced these numbers on the held-out Mozilla test set (3495 examples, never seen during training):

Run	Dev WER	Test WER
Naive (random split, no normalization, no regularization)	42%	not measured (split was contaminated)
Improved (official splits + normalization + regularization + augmentation)	26.08%	28.48%

The dev/test gap of about 2.4 points is normal and indicates the model isn't overfit to the dev set either. We were honestly surprised the model worked this well for a language Whisper had never seen — a testament to Whisper's strong multilingual phonetic priors.

Looking at actual predictions on the test set, most errors are diacritic confusions or single-character swaps:

ğormotik gamaǩçǩvidan        | ngormotik gamamçkvidan      ← 2 char errors
dido ini on                  | dido ini on                 ← perfect
aya lemşik va duçvinasinon   | aya lemşik va duçvinasinon  ← perfect
nana baba do bere isa renan  | nana baba do bere isa renan ← perfect
pucis tzǩuni uğun            | puciş tzǩuni uğun           ← single char swap

The character error rate would likely be much lower than the word error rate — maybe ~10-12% — since most errors are partial-word.

Trying it on real-world audio

We ran the model on a 6-minute Laz story from the LazuriTV YouTube channel — out-of-domain conversational speech, not isolated Common Voice sentences. The output was fluent Laz throughout, with consistent diacritic usage and natural word boundaries:

heva ǩa xalaşi mtsxuli metasi çiruğai va moxtaşa opute mtsxuli do oşkuri mogorum badi şuri çirdumtu...

A native speaker would still find errors, but the model produces something that's recognizably and consistently Laz — not English-tinged gibberish or Turkish-leaking output.

What's next

A few things we explicitly did not solve here:

Punctuation and case restoration. Since we normalized everything to lowercase without punctuation during training, this system is more accurately called an ASR (Automatic Speech Recognition) than a full STT (Speech-to-Text). True STT requires restoring sentence boundaries, capitalization, and punctuation — usually as a separate post-processing step using a small language model. We'll cover this in a future post.

Efficient deployment. Having a fine-tuned model is one thing; serving it efficiently for inference is another. Whisper-small runs comfortably on CPU but a production system would want batching, streaming, and possibly quantization. We'll cover that separately too.

Closing the data loop. The most impactful next step isn't a better model — it's more data. With a working (if imperfect) Laz ASR model in hand, the natural next step is to transcribe the LazuriTV YouTube archive and have Laz speakers correct the output. Correcting a 28% WER transcription is dramatically faster than transcribing from blank audio, so this human-in-the-loop workflow could realistically double or triple the training set. The Laz Institute is the right partner for this — they have both the linguistic expertise and the community connections to make it happen.

If you're a Laz speaker who would like to help, or an organization working with under-resourced languages and looking for similar work, get in touch.

The complete training script

The full train_whisper_laz.py is available on our GitHub. The key sections:

# train_whisper_laz.py
import os
os.environ["OMP_NUM_THREADS"] = "1"   # critical: prevents PyTorch STFT deadlock

# ... [data loading, model setup, audio extraction, collator, metrics] ...

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-small-laz",
    per_device_train_batch_size=64,
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=2000,
    fp16=True,
    eval_strategy="steps",
    eval_steps=250,
    save_steps=250,
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    weight_decay=0.01,
)

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["dev"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    processing_class=processor.feature_extractor,
)

trainer.train()