Enhancing Low-Resource Lampung Speech Recognition through Cross-Lingual XLSR-Wav2Vec 2.0 Pretraining

Hendra Kurniawan; Akmal Junaidi; Favorisen Rosyking Lumbanraja; Wamiliana Wamiliana

doi:10.47738/jads.v7i3.1388

Enhancing Low-Resource Lampung Speech Recognition through Cross-Lingual XLSR-Wav2Vec 2.0 Pretraining

Hendra Kurniawan, Akmal Junaidi, Favorisen Rosyking Lumbanraja, Wamiliana Wamiliana

Abstract

This study investigates the application of Wav2Vec 2.0 (W2V2) and Cross-Lingual Speech Representation (XLSR) models to Lampung language speech recognition. LampungNyow v1.0 is introduced, a speech corpus designed to provide a baseline for training and evaluating Automatic Speech Recognition (ASR) for this low-resource regional language of Indonesia. The dataset enables supervised fine-tuning and standardized evaluation, addressing the lack of publicly available linguistic resources for Lampung. Several pre-trained W2V2 models on Lampung speech recognition using Word Error Rate (WER) as the evaluation metric. The evaluated models include W2V2-Base, W2V2-Large, W2V2-Large-XLSR-Indonesian, W2V2-Large-XLSR-Sundanese, W2V2-Large-XLSR-53, and the multilingual W2V2-Large-XLSR-Indonesia-Javanese-Sundanese model. Monolingual models have higher WER values, according to experimental results: W2V2-Base achieved 36,23%, while W2V2-Large achieved 36,30%. XLSR models, such as XLSR-53 (33,88%), Sundanese (33,99%), and Indonesian (33,70%), demonstrated modest improvements. The W2V2-Large-XLSR-Indonesian-Javanese-Sundanese model, which was the foundation for the Lampung automatic speech recognition system in this study, achieved lower WER of 17,39%. These findings suggest that, in contrast to more comprehensive multilingual or monolingual pretraining models, multilingual pretraining utilizing a number of Indonesian regional languages can produce acoustic and contextual speech representations that are better suited for the resource-constrained Lampung automatic speech recognition task. When compared to the baseline W2V2-Large model, the obtained WER of 17,39% indicates a relative improvement of more than 50%.

Article Metrics

Abstract: 13 Viewers PDF: 6 Viewers

Keywords

Low-Resource Language, Cross-Lingual Learning, Automatic Speech Recognition, Wav2Vec 2.0, XLSR

Cite:

How to cite item

Full Text:

PDF

DOI: https://doi.org/10.47738/jads.v7i3.1388

Citation Analysis:

Refbacks

There are currently no refbacks.

Journal of Applied Data Sciences

ISSN	:	2723-6471 (Online)
Publisher	:	Bright Publisher
Website	:	http://bright-journal.org/JADS
Email	:	taqwa@amikompurwokerto.ac.id (principal contact)
		support@bright-journal.org (technical issues)

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0

Username
Password
Remember me