FlipStar: A Survey of the Current State of Filipino ASR

Image credit: Derek Brumby
Table of Contents

TL;DR

The inaccessibility of quality, standard datasets for training and evaluating Filipino speech-to-text or automatic speech recognition (ASR) systems made it difficult to advance Filipino voice AI in its home country, the Philippines. Through global efforts and the advent of open source recipes and pretrained models, we can circle back and bridge the gap between the past and the present state of Filipino speech recognition. Using standard labeled evaluation datasets in Filipino, reports on the out-of-the-box, zero-shot performance of existing ASR sytems were tabulated. Trained with videos from YouTube with weak labels, a next-gen Kaldi-based quantized (int8) zipformer transducer model outperforms Whisper’s large-v3 model on the condition of normalized lowercase and apostrophe-only ground truths. Quick and dirty subjective qualitative tests once again show that evaluation metrics need to be carefully qualified and should be taken with a grain of salt. This is a work-in-progress.

Motivation

I have two main goals in this blog post. First, I’ll take you through my nearly 20-year journey working on creating a reliable Filipino automatic speech recognition (ASR) system. Along the way, I’ll explain the challenges I faced, particularly why some earlier efforts didn’t succeed and even slowed down my already debilitating perfectionism. Second, I’ll do a small performance rundown of some of the available off-the-shelf models out there. We’ll look at where we stand now with this technology and explore what the future holds.

My Initial Encounter with ASR

I was first introduced to ASR back in 2006 for my undergrad project requirement at the University of the Philippines in Diliman (UPD). At that time, it wasn’t as advanced as it is now—there was no Siri nor Alexa, and the technology was still quite raw. My project advisor, who suggested I explore this area, had some experience with data collection but not much beyond that. We were all fascinated by the simple yet groundbreaking concept of a computer transcribing spoken words. However, since I was merely doing it as part of our curriculum I was under tight time constraints. This didn’t allow any of us the luxury to experiment or thoroughly replicate existing research, which often omitted critical details necessary for successful implementation. Moreover, we had to adapt the technology to work with the Filipino language directly from theory. Adding to our challenges, the tools we now take for granted, like widespread versioning and open-source repositories, weren’t available—$\texttt{svn}$ (Subversion) was the go-to, and $\texttt{git}$ hadn’t even been created yet.

Dataset Challenges and Evolution

At the heart of every successful demonstration of automatic dictation using ASR is a good labeled dataset for training. Sadly, my observation and experience tell me that we really only took the quality aspect seriously when end-to-end and the Transformer architecture revolution came. Data collection is a huge deal and a lot of careful planning is necessary to ensure that requirements for ease of development and effective training are met.

Early Initiatives and the Tagalog-Filipino Conundrum

For Filipino ASR there’s also the issue of clearly differentiating between Tagalog and Filipino, which wasn’t much discussed in our group at the time. The initial attempt was to build a database called the Filipino Speech Corpus (FSC), which despite its name mostly (if not completely) used Tagalog for the reading prompts. This dataset also included recording sequences of syllables intended for developing text-to-speech technologies. There is even a separate volume for spontaneous speech, which was never properly labeled or transcribed. Recordings were made in long continuous sessions, which meant that precise timestamping was crucial. The designers of the corpus focused so much on creating elaborate documentation that they overlooked the importance of high-quality labeling. For instance, they devised a file-naming system with placeholder variables intended for future classification but missed the more important discussion of ensuring label quality. Unfortunately, the metadata and labels were poorly managed, leading to unreliable and in the end mostly unlabeled data. Although the prompts were designed to be phonetically balanced, they ended up lacking in diverse phonetic articulation contexts. There was also little variation in the prompts, with many participants reading the same sentences, making it tough to create effective splits for training, validation, and testing (train-dev-test) datasets.

Progress and Partnerships (But Still Plenty of Problems!)

In 2009, I was fortunate enough to be invited to Karlsruhe Institute of Technology (KIT) in Germany, where I experienced firsthand the development of cutting-edge voice-enabled systems. I was there with the team behind Jibbigo, the first offline speech translation app for mobile devices. Witnessing their demos of lectures being translated from German to English captions showing in real-time—functional and live—opened my eyes to the vast potential of this technology, an impression that has stayed with me ever since. Prior to coming to KIT, in preparation for the training, we had to create a Filipino language dataset while I was still doing my Master’s in the Philippines. This I believe was called the CMU-KIT Filipino corpus, through a collaboration with Carnegie Mellon University. We were provided as prompts machine-translated English texts from standard NLP resources covering basic expressions, travel, medical fields, and additional prompts via news articles sourced from the internet. However, the machine translations often sounded like archaic sentence constructs in Tagalog, and some of the Filipino words were merely transliterations. Despite these issues, we were not allowed to make any changes to the prompts.

When I returned in 2010, I was part of a new program called the Interdisciplinary Signal Processing for Pinoys (ISIP; “pinoy” is a slang term for Filipino as a race). This program tackled various projects, including developing corpora and collecting spoken datasets for the ten most spoken languages in the Philippines. Focusing the discussion on Tagalog, the team designing the text prompts didn’t take into account the lessons from the earlier Filipino Speech Corpus. Consequently, the new corpus still lacked the necessary variety and depth to support the development of practical systems. I, on the other hand bringing in the learnings from my training in KIT, had to focus on an immediate application of Filipino ASR, settling with subtitling news broadcasts. I didn’t have much of a choice but to force the existing mismatched datasets to make captioning possible. While there were certainly discernible improvements, the lack of rigor and post-mortem analysis of past experiences clearly turned all of this into somewhat wasted effort.

Global Efforts and How Far ASR Has Come

In the US, the government’s IARPA agency has rolled out several projects that included the Filipino language. One notable initiative is the BABEL program, which produced the IARPA Babel Tagalog Language Pack, developed by Appen. This package, known as IARPA-babel106-v0.2g, contains around 213 hours of Tagalog (actually Filipino) conversations and scripted calls from 2012, complete with transcripts. The BABEL program aims to develop speech recognition technologies for less commonly supported languages to improve keyword search in large speech datasets. Following this, between 2016 and 2017, the MATERIAL program was launched, which also included Filipino datasets, some of which were derived from the BABEL program, plus additional snippets from TV news programs. A key point to note about the Filipino data from the BABEL program is that it was recorded over a poor-quality mobile (wireless) channel. Although I haven’t conducted a detailed analysis, it’s clear from initial observations that many parts of the speech were intermittently cut off and “choppy”, and the overall sound quality was quite similar to muffled, almost old-fashioned narrowband telephone calls.

As for recent developments, while there are some lesser-known (not to mention, not free) APIs for Filipino speech recognition, most now prefer using Whisper by OpenAI, introduced in 2022. Its published paper provides insights into how they incorporated multiple languages, and Filipino via FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech), which includes an established train-dev-test split.

Summary of Existing Labeled Filipino Speech Datasets

Here’s a quick summary of the datasets I’ve mentioned. The hours listed are rough estimates of the speech that has transcription labels, not counting any silent segments. Note that some datasets don’t have clearly defined splits; I’ve included all of these in the training category:

$\textbf{Dataset}$ $\textbf{Domain}$ $\textbf{Speaking}$ $\textbf{Style}$ $\textbf{Train}$ $\textbf{(h)}$ $\textbf{Dev}$ $\textbf{(h)}$ $\textbf{Test}$ $\textbf{(h)}$ $\textbf{Transcriptions}$ $\textbf{License}$
$\text{Filipino}$ $\text{Speech}$ $\text{Corpus}$ $\text{(FSC)}$ $\text{(vol. 1-4)}$ $\text{Random sources}$ $\text{Narrated}$ $52.5$ $0$ $0$ $\text{Normalised}$ $\text{User}$ $\text{Agreement}$
$\text{ISIP}$ $\text{Corpus}$ $\text{(Filipino}$ $\text{subset)}$ $\text{Random sources}$ $\text{Narrated}$1 $33.5$ $0$ $0$ $\text{Normalised}$ $\text{User}$ $\text{Agreement}$
$\text{CMU-KIT}$ $\text{Filipino}$ $\text{Speech}$ $\text{Corpus}$ $\text{Machine}$ $\text{translated}$ $\text{basic,}$ $\text{travel,}$ $\text{medical}$ $\text{texts,}$ $\text{news}$ $\text{from}$ $\text{online}$ $\text{sources}$ $\text{Narrated,}$ $\text{spontaneous}$ $\text{(aware)}$ $28$ $0$ $0$ $\text{Normalised}$ $\text{User}$ $\text{Agreement}$
$\text{IARPA}$ $\text{BABEL}$ $\text{Tagalog}$ $\text{Crawled text, conversational}$ $\text{Narrated, spontaneous}$ $115$ $13$ $85\text{(?)}$2 $\text{Punctuated}$ $\text{&}$ $\text{cased}$ $\text{User}$ $\text{Agreement}$
$\text{IARPA}$ $\text{MATERIAL}$ $\text{(excluding}$ $\text{BABEL}$ $\text{data)}$ $\text{Broadcast}$ $\text{News}$ $\text{(YouTube?)}$ $\text{Spontaneous}$ $7$ $7$ $3.5$ $\text{Punctuated}$ $\text{&}$ $\text{cased}$ $\text{User}$ $\text{Agreement}$
$\text{FLEURS}$ $\text{(Filipino}$ $\text{subset)}$ $\text{Random}$ $\text{sources}$ $\text{Narrated}$ $8$ $2$ $5$ $\text{Normalised;}$ $\text{punctuated}$ $\text{&}$ $\text{cased}$ $\text{CC-BY-4.0}$

The survey underscores a common issue across datasets: the lack of variability and established splits, which are crucial for developing robust ASR systems. For example, datasets like FSC lacked comprehensive phonetic representations, and many datasets were left in a state without carefully established labels.

Aside from what’s already discussed, there are also other unlabeled Filipino speech datasets from other public releases, originally intended for other use cases:

$\textbf{Dataset}$ $\textbf{Domain}$ $\textbf{Speaking}$ $\textbf{Style}$ $\textbf{Train}$ $\textbf{(h)}$ $\textbf{Dev}$ $\textbf{(h)}$ $\textbf{Test}$ $\textbf{(h)}$ $\textbf{License}$
$\text{Filipino}$ $\text{Speech}$ $\text{Corpus}$ $\text{(FSC)}$ $\text{(vol.}$ $\text{5)}$ $\text{Random}$ $\text{topics}$ $\text{Spontaneous}$ $\text{(aware)}$ $6.5$ $0$ $0$ $\text{User}$ $\text{Agreement}$
$\text{International}$ $\text{Corpus}$ $\text{of}$ $\text{English}$ $\text{(Filipino}$ $\text{subset)}$3 $\text{Random}$ $\text{topics}$ $\text{Oratory,}$ $\text{spontaneous}$ $\text{(aware)}$ $29.5$ $0$ $0$ $\text{User}$ $\text{Agreement}$
$\text{VoxLingua107}$ $\text{(Filipino}$ $\text{subset)}$ $\text{YouTube}$ $\text{Possibly}$ $\text{mixed}$ $93$ $0$ $0$ $\text{CC-BY-4.0-DEED}$
$\text{YouTube-8M}$ $\text{(Filipino}$ $\text{subset)}$ $\text{TV}$ $\text{Programs}$ $\text{Narrated,}$ $\text{acted,}$ $\text{spontaneous}$ $39$ $0$ $0$ $\text{CC-BY-4.0}$

There may be other datasets mentioned in various research papers that I haven’t included here, either because they aren’t openly available or I don’t have access to them. Researchers are welcome to reach out if they’d like their work to be included for a more comprehensive overview.

Developing Solutions and Benchmarking

FlipVox: A New Beginning

Driven by the shortcomings observed in earlier datasets and the need to keep up with the advancement in the technology, I initiated a personal project called “Project Flipside” in 2021. This mini project led to the development of a decently performing Kaldi-based speech recognition model—though I admit, that’s just my personal assessment. I trained the model by painstakingly correcting and normalizing the existing labels from some of the datasets I’ve discussed and made it publicly available through AlphaCephei’s Vosk (thanks to Nickolay Shmyrev). Now with the rise of Large Language Models (LLMs), I’ve come to realize the expanding potential of conversational and speech AI technologies. Motivated by this, I decided to take another step further and hit the reset button, rebrand what I’m working on with a DBA label, while fundamentally reassessing the situation (as outlined in this report). Hence, FlipVox, which I like to abbreviate as “f7x,” was born. Hopefully, it will turn into something that contributes to the advancement of Filipino speech AI.

Present Solution: Weak Labels to the Rescue

From the surveys I conducted I soon realized that many projects working with low-resource languages have turned to YouTube to collect data. As I previously mentioned, early efforts to design a speech corpus fell short because they didn’t account for the extensive data needs of modern end-to-end systems. This insight led me to shift my focus toward compiling a more substantial dataset, mirroring recent successful initiatives. While downloading and extracting audio from YouTube is straightforward, labeling it is a time-consuming and resource-intensive task. It took me several months to gather enough data, use weak labels through my existing models, and iteratively refine the labels, the entire collection of which I’ve named the FlipVox dataset. I eventually used this larger (and better) dataset to upgrade my model to a more cutting-edge transducer model using next-gen Kaldi models (e.g. Zipformer), now showcased in demos on Hugging Face. In the comparative evaluation section of this report, I’ll refer to this system as the $\texttt{f7x-transducer}$.

Benchmarking Efforts

Inspired by the Open ASR Leaderboards on HuggingFace, though not quite as thorough and unbiased, I’ve set out to evaluate the effectiveness of our current Filipino ASR systems through a running leaderboard of sorts, which as part of the FlipVox suite I’d like to name FlipStar. I’m starting with five systems: my last Kaldi model using vosk as decoder ($\texttt{vosk-fil-v2}$4), my updated icefall-based $\texttt{f7x-transducer}$ via sherpa-onnx, a fine-tuned wav2vec2 model from Hugging Face $\texttt{wav2vec2-khalsuu}$, the Filipino model via Azure Cognitive Services API and Whisper models. Looking ahead, I plan to include more models as they come and also refine the processes for a more apples-to-apples comparison. For now, I will use the standard validation and test sets where I have access to ground truth labels to make comparisons. To clarify, for the models I trained, none of them used any of these validation and test set for training (i.e. updating parameters). The first round of inference results in terms of word-level accuracy, 100 minus percent-word-error-rate (%WER5), are on the table below (higher numbers are better):

$\textbf{ASR}$ $\textbf{system}$ $\texttt{FLEURS-valid}$ $\texttt{FLEURS-test}$ $\texttt{BABEL-valid}$ $\texttt{MATERIAL-valid}$
$\texttt{azure-filipino}$ $60.92$ $59.01$ $40.67$ $72.84$
$\texttt{f7x-transducer}$ $\texttt{(int8)}$ $89.90$ $89.78$ $69.03$ $80.35$
$\texttt{vosk-fil-v2}$ $84.09$ $83.85$ $1.64$ $62.78$
$\texttt{wav2vec2-khalsuu}$ $56.04$ $57.18$ $8.54$ $32.34$
$\texttt{whisper-medium}$ $82.45$ $83.34$ $27.53$ $65.12$
$\texttt{whisper-large-v2}$ $87.80$ $87.81$ $29.33$ $67.48$
$\texttt{whisper-large-v3}$ $89.29$ $89.28$ $36.74$ $71.93$

We can see that the $\texttt{f7x-transducer}$ system outperforms other systems that were evaluated. However, we’ll need to invest additional time and effort to qualify this result. I’ll break down the key details that we need to keep in mind:

  1. Ground truth and inference outputs have been both converted to lowercase, and except for apostrophes, all punctuations were removed. As much as possible, post-processing settings were set to optimize the performance of all systems.
  2. There are other factors that are not taken into consideration here. For example, the f7x-transducer model size I used for this is comparable to whisper-base, and is a quantized (int8) model. Therefore, whisper models are at an advantage.
  3. It’s important to note that this comparison is not a reflection of the modeling capabilities of the ASR architectures being tested. It’s only a basic zero-shot test to see how well each system performs right out of the box, no finetuning and no external language model involved. For instance, a user that needs a quick solution for transcription can immediately choose which system is most likely to accurately transcribe a given recording without any context.
  4. We haven’t differentiated between systems operating in offline versus streaming modes. Notably, the vosk decoder system operates in streaming mode, which could be a slight handicap in our comparison.
  5. Creating an ideal evaluation method that accounts for actual substitution, insertion, and deletion errors is challenging. The diversity in dataset labeling standards and the tendency of ASR systems to prefer certain text formatting can influence results. For now, our goal is to gauge the baseline performance of each system across various datasets.
  6. Each of these systems were trained using different datasets and subjected to different regularization, perturbation, and augmentation techniques. This is the reason why the latest icefall model can do better on BABEL as it has seen the BABEL training data.

Quick Qualitative Comparisons ($\texttt{f7x-transducer}$ vs $\texttt{whisper-large-v3}$)

For anyone interested, I can provide the evaluation logs for the given performance table above (I’m showing a side-by-side comparison using Meld on the figure above). However, to make the qualitative comparisons more interesting, I decided to look at longest audio samples from the unlabeled datasets and comment on the performance of the two best systems (on average) against how they performed on the benchmark table based on transcription difficulty. In particular, from the nature of the benchmarking datasets we can categorize them into three levels of difficulty:

  1. FLEURS is EASY to transcribe because they are clean recordings
  2. MATERIAL is NORMAL as it comes from a variety of sources on YouTube
  3. BABEL is HARD because of the extreme channel condition and the spontaneity of sentence constructs are very difficult to transcribe even for humans.

Based on this, we can use FSC’s spontaneous subset for EASY, any random media from YouTube-8M or VoxLingua107 for NORMAL, and any low-recording-quality audio from the Filipino subset of ICE (with the added bonus of the corpus being in lossy format MP3).

EASY: FLEURS and FSC volume 5

This is from the FLEURS test set:

f7x-transducer whisper-large-v3
subali’t ang mga planong ito ay naging lilipas na sa halos magdamag nang mahigit sa 800 na ang mga sundalo mula sa pulang hukbo ng unyong sobyet ay lumusob at bumuo ng mga larangan at ukrainian matapos salakayin ang mga silangang rehiyon ng poland na paglabag sa kasunduang pangkapayapaan sa riga ang soviet olakong kasunduan ng kawalang agresyon at iba pang mga kasunduang i internasyonal o multilateral Subalit ang mga planong ito ay naging lipas na sa halos magdamag ng mahigit sa 800,000 na ang mga sundalo mula sa pulang hukbo ng Union Soviet ay lumusog at bumuo ng mga larangan Belarusian at Ukrainian matapos sa lakayin ang mga silangang rehyon ng Poland na paglabag sa kasunduang pangkapayapaan sa Riga, ang Soviet-Polakong kasunduan ng kawalang agresyon at iba pang mga kasunduang international, bilateral man o multilateral.

Right off the bat, we can see how Whisper’s post-formatting (inverse text normalization and punctuation) has a more appealing output than most ASR systems that were trained using normalized labels. Again, the benchmarking numbers are agnostic of this advantage in readability of outputs. Given that the accuracy numbers between the two systems are close for the FLEURS datasets, Whisper could easily beat f7x-transducer under this setting.

This is from the FSC volume 5 set:

f7x-transducer whisper-large-v3
hello grabe labo nandito ako ngayon di ko alam kung bakit kahapon kasi championship mo basketball ng nara kalaban yung i c grabe sobra sobrang dumi talaga ng laro ng nara nakakainis wala na silang ginawa kundi magmura magreklamo na magreklamo sa referee yung iba yung isa nga binato pa yung bola sa referee nakakainis talaga parang di basta kakainis mong laro kakainis pa rin kaya ayun nakita ko si ano nakita ko si gift nakipag usap kay hulyo yun narinig ko sabi niya o bukas na lang bayad ha kung gusto mo kung kailangan mong pera dibigay ko sayo eh di tinanong ko si gift kung bakit ano ba yung meron kaya sabi niya sabi niya kay julio ayun meron palang sa tripley daw merong ano merong record record ng boses kaya sabi ko sama na ako dyan para ano kaya ayun niyaya ako ni gift kaya sabi niya bukas na lang eh nakita ko siya kanina ng paggising ko naglalaro sila ng basketball ewan ko ano ba yung kalaban nun batangan ata yun naglaro sila ng sa balangan kaya tinanong ko siya kung okay na ba kung pwede ako sumama eh di sabi niya sige oo mamaya pagkatapos muna ako ng laro ng laro nila yung mas okay pa kaysa sa laro kagabi grabe talaga kagabi sobra wala wala ng kasang dumi yung laro puro mga hiritan mabuti hindi nauwi sa sapakan kaya natuwa pa ako sa laro kanina kahit na mga ewan ko pangit mga hindi masyadong magagaling he joke lang basta magaling din yung batangan tsaka ano tsaka ano ba yung kalaban yun kalilayan kina gift kina deep tas iba pang mga player don kaya lang nakakainis kasi hindi naman mga taga quezon talaga yung mga player nila mabuting batangan puro mga batanggenyo yung sa kanila nandun si kalinaw hindi naman ata taga quezon yun tsaka si spokes hindi naman taga quezon yun tas kinukuha nila para para maglaro sa province nila parang daya din noh kakainis nakakahi mga ganon basta ganun tapos nun sabi eh nung natapos na yung laro sabi na ni gift o di punta na tayo tapos ay naglalaba pa ako sabi ko sige sunod na lang ako kaya magtapos kung may paglalaba ko kaya ano pagkatapos nun Hello. O, grabe. Labo, nandito ako ngayon. Di ko alam kung bakit. Kahapon kasi, championship magbasketball ng Nara. Kalaban yung IC. Grabe, sobra. Sobrang dumi talaga ng laro ng Nara. Nakakainis. Wala na silang ginawa kundi magmura. Magreklamo na magreklamo sa referee. Yung iba, yung isa nga binato pa yung bola sa referee. kakainis talaga, parang di, basta, kakainis mong laro, kakainis pa rin. Kaya yun, nakita ko si, ano, nakita ko si Gif, nakipag-usap kay Julio. Narinig ko, sabi niya, oh, bukas na ang bayad ha, kung gusto mo, kung kailangan mong pera, dibigay ko sa’yo. Eh, di tinanong ko si Gif, kung bakit, ano ba yung meron? Kaya sabi niya, sabi niya kay Julio, ayun, meron palang sa Triple E daw, merong, ano, merong record-record ng bosses. Kaya sabi ko, Saman ako dyan para ano, kaya yun, niyayay ako ni Giff. Kaya sabi niya, bukas na lang. Nakita ko siya kanina nang pagising ko, naglalaro sila ng basketball. Ayaw ko ano ba yung kalaban noon, batangan ata. Naglaro sila sa batangan, kaya tinanong ko siya kung okay na ba, kung pwede akong sumama. Sabi niya, sige, oo, mamaya pagkatapos. Inulod muna ako ng laro ng laro nila. Mas okay pa kaysa sa laro kagabi. Grabe talaga. Kagabi sobra. Wala. Wala lang kasi dumi yung laro. Puro mga hiritahan. Mabuti hindi nagnawis sa sapakan. Natuwa pa ako sa laro kanina kahit na mga ewan ko pangit. Mga hindi masyadong magagaling. Joke lang. Basta, magaling din yung batangan tsaka ano, tsaka ano ba yung kalaban? Yung kalila yan, kinagif, kinadif. Sige pa bang mga player doon? Kaya lang nakakainis kasi hindi naman mga taga-Kesan talaga yung mga player nila. Mabuting batangan, puro mga batanggenyo. Isa kanila, nandun si Kalinaw, hindi naman ata taga-Kesan yun. Tsaka si Fox, hindi naman taga-Kesan yun. Kasi kinukuha nila para Para maglaro sa province nila, parang daya din, no? Kakainis, kakainis mga ganun. Basta, ganun. Tapos nun, sabi, eh, nung natapos na yung laro, sabi, sinabi na ni Giff, oh, dey, punta na tayo. Tapos, eh, naglalabab pa ako, sabi ko, sige, sunod na lang ako. Kaya magtaposin ko muna yung paglalabab ako. Kaya nun, pagkatapos nun,

Notwithstanding the post-formatting, both systems are head-to-head in this case.

NORMAL: MATERIAL and a YouTube sample from YouTube-8M / VoxLingua107

This is from the MATERIAL dev set:

f7x-transducer whisper-large-v3
at si manny mismo ’no ’nong nakausap din natin siya <hes> few days ago sinasabi niya na i sc very very grateful ano thankful siya doon sa mga blessings at ang dami dami daw talagang blessings na kaniyang tinatamasa ngayon at hinding hindi niya ‘yan makakalimutan kaya ang ginagawa niya rin is really mag share is hindi lamang do’n sa mga pamilya niya ’no na mga kaanak niya kung hindi sa lahat ng pupuwedeng pag share niya ng kaniyang blessings at <hes> ma maiba lang tayo <hes> partner kanina supposedly ay mayroon pa siyang afternoon gym session session ano pero <hes> hindi na ito tumuloy ang ating pambansang kamao dahil na pinili na lamang niya na magpahinga dito sa kaniyang bahay at kakagising lamang niya no’ng simula kaninang after lunch no’ng uh pagkauwi niya do’n sa kaniyang <hes> training medyo nagkuwentuhan dito sa kaniyang bahay at hindi na nga ito nag <hes> nagpasya ito na talagang magpapahinga na lamang dahil may jet flag pa rin ang ating pambansang kamot At si Manny mismo, nung nakausap din natin siya a few days ago, sinasabi niya na he’s very, very grateful, thankful siya dun sa mga blessings. At ang dami-dami daw talagang blessings na kanyang tinatamasa ngayon at hindi niya yan makakalimutan. Kaya ang ginagawa niya rin is really mag-share, hindi lamang dun sa mga pamilya niya, na mga kaanak niya, kung hindi sa lahat na pupwede yung pag-share niya ng kanyang blessings. At maiba lang tayo partner. Kanina, supposedly ay meron pa siyang afternoon gym session. Pero hindi na ito tumuloy ang ating pambansa kamo dahil pinili na lamang niya na magpahinga dito sa kanyang bahay. At kakagising lamang niya simula kanina after lunch nung pagkauwi niya doon sa kanyang training. Medyo nagpwentuhan dito sa kanyang bahay. At hindi na nga ito, nagpa siya ito na talaga magpapahinga na lamang dahil may jet lag pa rin ang ating pambansa kamo.

Again, we see more satisfying outputs from Whisper. The advantage of f7x-transducer is that it better reflects the actual nuances of the spoken words, including disfluencies and fillers. Clearly, as it has been trained with BABEL labels, tags such as <hes> appears often.

The next audio is from the YouTube-8M set:

f7x-transducer whisper-large-v3
let’s talk about a show you know people are still asking ano ba talaga yung yung palabas na your face sounds familiar in a language na maiintindihan namin ano ba talaga ito well ano sya it’s it’s it’s all celebrities lahat ng mga idolo natin walong celebrities makikita natin sila i impersonate nila ang mga iconic kumbaga yung mga kilala ng mga foreign or local celebrity anong mangyayari yan linggo linggo is it oo there’s a random bunutan as a na na ah okay so it’s assigned out sabi ni m j so ibig sabihin tuwing linggo bawat isa sa walong celebrities ay may gagayahin okay at ano ha walang matatanggal they build up they build up points each week kaya kung sakali kunyari si jen contestant kung medyo hindi maganda ang performance nya meron pa syang chance next na bumawi na bumawi sa point at a certain point ito’y aggregate score to accumulated because at a certain point you will have to choose the top four pero scores ito lahat ng mga performances nila every weeks it’s one season it thirteen weeks so may twelve weeks iaano ia adopt yung points tapos on the thirteenth week yun na may tanong ako halimbawa ah ako tapos ang nabunot core ang na assign sa akin eh let’s say si ah girl o lanie salucha i will have to do it yes you have to be ah kailangan gawin mo talaga you have to look like lanie talk and do i have a team do i have a team i mean i make up armand meron din naman so hindi ako pwedeng malinis galing naman oo pero malalaman lang talaga nila at the end of each show the press pack proof nung lalabas yung susunod na gagawin nila the next week i could i connector actually icon later halimbawa si carla estrada pwede siyang si daniel padilla ang kanyang next week pwede ba sabihin pero ang ang ano kasi is that you have to look and sound alike ang problema kasi is that yung mga contestants ay singers so may may mga may certain sound may certain sound na sila volante so i don’t wanna hear nyo yes i wanna i don’t wanna hear jolina i wanna hear jolina sounding you judges parang ikaw ang in charge ng bola boys Let’s talk about us, sir. You know, a lot of people are still asking, ano ba talaga yung palabas na your face sounds familiar? In a language na maiintindihan namin, ano ba talaga ito? Well, ano siya, it’s all celebrities. Lahat ng mga iniidolo natin, walong celebrities, makikita natin sila, i-impersonate nila ang mga iconic, kumbaga yung mga kilala ng mga foreign or local celebrities. Paano mangyayari yan? Linggo-linggo? Linggo-linggo. Is it for a fair? There’s a random bunutan. Ah, okay. So, it’s a sign daw, sabi ni MJ. So, ibig sabihin, tuwing linggo, bawat isa sa walong celebrities ay may gagayahin. Yes. Ah, okay. At ano ah, walang matatanggal. Yun, yun na there. They build up points each week. Kaya kung kanyari si Jen, contestant, Kung medyo hindi maganda ang performance niya, meron pa siyang chance next week na bumawi. At a certain point, ito ay aggregate score ito, accumulated ito. Because at a certain point, you will have to choose the top four. Pero scores ito lahat ng mga performances nila. 30 weeks. It’s one season, it’s 30 weeks. About 13 weeks. So 12 weeks, i-add up yung points. Tapos on the 13th week, yun na. Okay. May tanong ko. Top 4. Halimbawa, ako. Tapos ang nabunot ko or ang na-assign sa akin, let’s say si… Girl. Lani Misalucha. I will have to do it. Yes. You’ll have to be… Kailangan gawin mo talaga. You’ll have to look like Lani, talk and sing. Do I have a team? Do I have a team? Yes. I think naman diba? Meron din naman. So hindi ako pwedeng mamili. And they’re coaches also, right? Yes. Galing naman. Oo, pero grabe. At saka malalaman lang talaga nila at the end of each show, they’ll press, pak, poof, doon lalabas yung susunod na. May tamag dyan eh. Iconizer. Iconator actually. Iconator. Halimbawa si Carla Estrada, pwede siyang si Daniel Padilla ang kanya next week. Oo, pwede. Grabe, no? Pero ang ano kasi is that you have to look and sound alike. Yun na eh. Ang problema kasi is that Okay. Yung mga contestants ay singers. So may mga… May certain sound na sila. May certain sound na sila. Nyoy Volante. Nyoy Volante. So I don’t wanna hear Nyoy. I don’t wanna hear Jolina. I wanna hear Jolina sounding like… You’re saying that because among the judges, parang ikaw ang in-charge ng voice. The voice, yeah.

Even in this case, I give the point to Whisper.

HARD: BABEL and ICE Phils

This is from the BABEL dev set:

f7x-transducer whisper-large-v3
<sta> hindi wala naman ginagawa ko naman lahat ng gusto niya hindi ko talaga alam kung ano tinatanong ko nga kung anong problema <hes> wala naman tapos ‘yong kinausap ko ‘yong papa ko kung anong problema ng mama ko sabi niya baka ano lang daw ni regla ganito hindi ko alam tapos sinasabi naman ‘yon sabi ng papa ko baka may problema sabi ko tinanong ko kung may problema ang mama ko pero wala namang nasabi tapos ‘yong sabi pa ng isa kong kaibigan baka may may dini may dinidibdib daw si mama o ano ba tapos ‘yong gusto ko ngang makausap si tita Wala naman. Ginagawa ko naman lahat ng gusto niya. Hindi ko talaga alam kung ano. Tinatanong ko ako anong problema. Wala naman. Tapos pinausap ko yung papa ko kung anong problema ng mama ko. Sabi niya baka ano lang daw, may regla, ganito. Hindi ko alam. Tapos sinasabi naman yung, sabi ng papa ko, baka may problema. Sabi ko tinanong ko kung may problema ang mama ko. Pero wala naman ang sabi. Tapos sinasabi pa ng isa kong kaibigan, baka may dinidigdig daw si mama o ano ba. Tapos yung gusto ko nga makausap si Tisa.

This is where I think f7x-transducer slightly outperforms Whisper output, since more important words are correctly transcribed in the former.

This is from the ICE Philippines set:

f7x-transducer whisper-large-v3
ah gumawa po kami ng ah ng ah updated pass at eleven o’clock ah all over the country ito po ang situation and namin ah region one ah there are no reports of ah major delays in the voting except one municipalities and also ilocos sur reports ah shortage of supplies in car cordillera no reports of major problems on the delays region two no reports of major delays everything is ah going on region three everyday the voting is going on also the supplies arrive on time in batangas we have a report in two way kasi po nagkaroon kami ng ah substitute ng ah register diyan so medyo ah nagkaroon po ng delay nyan setting up of the precinct in region five ah everything seems to be normal or ah regional director says there are some problems but then manding it region six sa no ah reports ah of delays in ah voting region seven there are no major ah problems there region eight no major problems of delays region nine explained already by commissioner leora region tere some problems but are being solved region eleven ah also ah minor problems only as we said marawi ah contrary to i responsible reporting earlier marawi in fact ah i started the voting ah now i want to comment also ah on an incident that i saw on television and p o del pilar where there was somebody who made a big thing about the teachers not putting the staffs in the ballot box yun po namang mga teacher inamin nila sinabi nilang nagkamali sila and even the one who was ah complaining said that it was only happening in one precinct in the whole school and ah fine meron po naman kaming minutes of voting yun pong nakakakita ng ganun they can file a documentary protest ilalagay po yun sa minutes of voting and counting ng teachers i believe those teach ng updated pass at 11 o’clock all over the country. Ito po ang situation na namin. Region 1, there are no reports of major delays in the boating except one municipality, San Diego de Fonso, Hilo, Basur, reports a shortage of supplies. In Carr, Cordillera, no reports of major problems on delays. Region 2, no reports of major delays, everything is going on. Region 3, everything, the boating is going on, also the supplies arrived on time. In Batangas, we have a report in Tuy, kasi po nagkaroon kami ng substitute ng registrar dyan. So medyo nagkaroon po ng delay niya na setting up of the precinct. In Region 5, everything seems to be normal. Our regional director says there are some problems but they’re handling it. Region 6… No reports of delays in boating, Region 7. There are no major problems there. Region 8, no major problems of delays. Region 9, explained already by Commissioner Yora. Region 10, some problems but are being solved. Region 11… Also, minor problems only. As we said, Marawi, contrary to irresponsible reporting earlier, Marawi, in fact, has started voting. Now, I want to comment also on an incident that I saw on television in Pio del Pilar where there was somebody who made a big thing about the teachers not putting the stubs in the ballot box. Yung mga teacher, sinabi nila na kamali sila. And even the one who was complaining said that it was only happening in one precinct in the whole school. And fine, meron po naman kaming minutes of voting. Yung po nakakakita ng gano’n, they can file a documentary protest. Ilalagay po yun sa minutes of voting and counting ng teachers. I believe those teachers…

It becomes apparent that in terms of accuracy alone and not readability, it’s not straightforward to assess which system is more accurate. This sends a clear message about the importance of objectivity in carefully planning for an ASR leaderboard.

Final Notes and Future Directions

This report is merely a stepping stone in the ongoing journey of enhancing Filipino ASR technology. As a work-in-progress, it is open to contributions and improvements from the community. My plan is to continue refining the datasets, improving model accuracy, and expanding the range of speech styles and contexts covered. This effort not only aims to push the boundaries of Filipino ASR but also to foster a collaborative environment where researchers and developers can contribute and refine these tools.

Shameless Plug

I’m currently hosting some demos of the icefall models on Hugging Face spaces.

Future Endeavors

There is still a lot to do with this project but my top priorities would be to release the recipes and pretrained models on Hugging Face. I want to be able to cover every nook and cranny, but unfortunately it’s just too much work for me. Also, a more ambitious goal would be to formalize the dataset curation similar to common voice or other data curation initiatives that integrates automated solutions with manual human corrections and feedback.

Invitation to Collaborate: If you have insights, datasets, or improvements to suggest, please let me know by email or by commenting below. Together, we can drive the future of Filipino speech recognition technology forward.


  1. ISIP labels contain final three questions that are answered spontaneously but not transcribed and is not included here ↩︎

  2. The author have not seen this: IARPA BABEL Tagalog’s test set is only available via openKWS ↩︎

  3. The ICE Corpus of Philippine English actually comes with transcriptions but with formatting on rich-text format not easily parsed ↩︎

  4. This model is different from the one hosted on the Vosk website, version 2 is around 3-4% more accurate. ↩︎

  5. Word-error-rate is the standard metric for ASR, which is the ratio of word-errors to the total words spoken. ↩︎

Federico Ang
Federico Ang
Voice AI Researcher

My research interests include signal processing, multimedia AI, and languages.