Speech Emotion Recognition through Acoustic Data Augmentation and Attention-Driven CRNN-BiGRU

Authors

  • Fitra Kacamarga Bina Nusantara University
  • Kresna Andika Aprianto

DOI:

https://doi.org/10.31937/ti.v18i1.4250

Abstract

Speech emotion recognition (SER) systems have transformed human-computer interactions by enabling machines to identify emotional cues in speech. This study presents a comprehensive approach that combines robust data augmentation techniques with an advanced neural architecture to address these limitations. The proposed methodology employs four key data augmentation strategies to enhance model generalization and prevent overfitting: background noise injection, time stretching (both up and down), and pitch shifting. This augmented dataset is fed into a novel Convolutional Recurrent Neural Network (CRNN) architecture integrated with a Bidirectional Gated Recurrent Unit (BiGRU) and attention mechanism, designed to capture both local and temporal emotional features effectively. The model processes input through log-Mel spectrograms, enabling precise detection of emotional speech patterns. Experimental validation on the RAVDESS database demonstrated the superiority of this combined approach, achieving state-of-the-art performance with a weighted accuracy (WA) of 90.53% and an unweighted accuracy (UA) of 90.19%—representing an 11% improvement over CNN with Multi-Head method. These results validated the effectiveness of integrating data augmentation with advanced neural architectures for SER applications.

Downloads

Download data is not yet available.

Additional Files

Published

2026-06-30

How to Cite

Kacamarga, F., & Aprianto, K. A. (2026). Speech Emotion Recognition through Acoustic Data Augmentation and Attention-Driven CRNN-BiGRU. Ultimatics : Jurnal Teknik Informatika, 18(1), 9–16. https://doi.org/10.31937/ti.v18i1.4250