Leveraging Untranscribed Data for End-to-End Speech and Callsign Recognition in Air-Traffic Communication

Paper ID

SIDs-2025-103

Conference

SESAR Innovation Days

Year

2025

Theme

Project Name

Keywords:

automatic speech recognition; semi-supervised learning; air traffic control

Authors

Petr Motlicek, Amrutha Prasad, Shashi Kumar, Driss Khalil and Christof Schuepbach

DOI

https://doi.org/10.61009/SID.2025.1.49

Link

Download

Abstract

Accurate Automatic Speech Recognition (ASR) and callsign recognition in Air Traffic Control (ATC) are vital for safety, yet conventional two-step systems rely on large amounts of manually transcribed data, which is both costly and limited. This paper introduces a practical alternative using TokenVerse, a unified end-to-end model trained under a dual-task frame- work and enhanced through semi-supervised learning. Our main contribution shows that the model can jointly learn callsign boundaries and speech recognition, improving performance on both tasks simultaneously. Additionally, by generating pseudo- labels for 500 hours of unlabeled audio, we substantially expand the effective training data. Experiments across multiple in- domain and out-of-domain ATC datasets demonstrate that the TokenVerse framework achieves state-of-the-art performance in both ASR and callsign detection, surpassing cascaded pipelines built on modern architectures (including Kaldi, XLSR/wav2vec 2.0, Zipformer, and Whisper). This work provides a robust and scalable foundation for deploying and continuously refining high- accuracy ATC systems in real-world settings where labeled data is inherently scarce. The end-to-end architecture is also relatively compact (approximately 317M parameters), making it well suited for real-time, low-latency deployment.