Preliminary Study on Indonesian Word Recognition for Elder Companion Robot

— Word recognition using deep learning is a simple approach to speech recognition in general. From this word-level recognition, the emotional expression recognition model. The emotion recognition model can be used to describe the important level of action on future planned hardware implementation. This research was conducted using MFCC as the feature extraction method from the audio data and using the CNN-LSTM approach for the emotional expression classifier. The model itself will be implemented into a humanoid robot to become a companion robot for the elderly. The model itself has 67% accuracy for emotion recognition and 97% accuracy for word recognition. However, the model only attained 20% accuracy in real-life testing using the humanoid robot as the model tends to overfitting as a result of the lack of data used in model training.


I. INTRODUCTION
Speech recognition recently become a popular research subject in Machine Learning and Artificial Intelligence research studies. The speech recognition studies also expand into emotion recognition derived from speech data [1]. This phenomenon occurs because emotion is one of the fundamentals of day-to-day communication between humans. Speech emotion recognition studies grow because researchers expect machines could learn how to distinguish an emotion by audio and visual data [2], [3]. As the machine's capability to recognize emotion in audio or visual data could augment user experience in several areas, especially in services such as automatic call centers, virtual reality games, etc. [4] Previous studies in speech emotion recognition have already tried different approaches to building a learning network for emotional recognition. Wan [5] uses DTW in emotion recognition. But recently, deep neural networks using CNN and LSTM or RNN are gaining popularity for recognizing human emotion [6][7] [8] and compared with established methods like SVM, LSTM and CNN have greater accuracy results [9]. And recently, the researcher also tried to recognize speech and emotion from their respective natural language, as shown in the study of Wan [5], Guiming [10], and Wang [1]. Specific language emotion recognition, in this case, Indonesian also studied by Lasiman [9]. Wunarso et al. [11] even tried to build another dataset for the Indonesian language in their study. As stated by Park et al. [12], deep neural network output really depends on how feature selection is done, and also how pooling and padding are important in improving speech recognition. They also stated that stacking many convolutional layers as they used in their work to create very deep neural networks does not have a great impact on recognition.
Word and emotion recognition technology can be used to build a companion robot for elders. Although in Indonesia the prevalence of loneliness in the elderly is not too high due to the strong eastern culture, symptoms of seniors who come from the middle to upper economy are already living alone without the company of their families. One of the solutions that can be given to this problem is to use a companion robot which is already widely used in Japan and the US. Robot companion is used by seniors who are in the middle to upper economic level and do have the appropriate understanding of technology. The companion robot function emphasizes the response that can be given by the robot based on voice input, so word recognition and speech recognition are important. And an additional feature that is also important is the robot's ability to detect danger or emergencies based on variations in intonation which will be the emotion recognition feature of this robot.
The purpose of this research is to do a preliminary study on emotion recognition in segmented words. Emotion classes were chosen based on a plan of implementation on the robotic system. In previous work, some authors of this paper successfully experimented with the word recognition system using MFCC for feature extraction and implementing the model on CNN. Hence, this work will discuss on speaker-independent word-emotion recognition system, focusing on emotional classification regardless ISSN 2355-3286 of the speaker. This paper only discussed the method chosen for conducting training and validation of deep learning using CNN and LSTM.

II. DATA COLLECTION
In their study, Wunarso et al. [11] try to build an Indonesian speech-emotion database called I-SpeED and use SVM as the classifier method. While Lasiman [9] studied emotion recognition using a feed-forward neural network and LSTM for the Indonesian language.
In this research, the data was collected and built by using multiple audio files with .wav format. Audio files were recorded in Bahasa Indonesia, and based on three emotional expressions, "happy," "sad," and "angry." Speakers were asked to read ten words with their respective expressions. The words are chosen from the robot implementation scenario in future works. All used words are described in Table 1.
The word-emotion database recorded from 100 different speakers consists of 9000 audio files. Each speaker must speak each word 3 times for each emotional expression. Speakers consist of 50 males and 50 females. This approach is necessary because, in the previous study, the training and testing result is biased by serious overfitting caused by a lack of diversity in the database [13]. Also, the general expression of "sad" and "angry" between males and females has different energy and frequency.
Participants were first asked to fill out a simple questionnaire regarding their mood at the time. If the participant is in a very sad or angry condition, the recording will not include the "happy" state and will be rescheduled for another day. To induce the "happy", "sad" and "angry" emotional state, participants were asked to view a video for each emotional state. The videos are purely chosen by the data sample collector and to minimize bias before recording, participants were also asked once again about their emotional state after watching the video.

III. METHODOLOGY
The features from audio data were extracted using the MFCC method before being fed into CNN-LSTM networks to classify audio file samples into three emotion classes. Each audio is preprocessed to have the same ±1000ms length with .wav file format. Shorter data will be added with zeroes, and longer data will be cut to fit in a 1000ms timeframe. This work will use MFCC for the feature extraction method, as shown by several studies that MFCC has greater output accuracy compared with other methods like DTW, hence MFCC become one of the main methods to process audio samples to be used in deep learning [7] [14]. CNN-LSTM is chosen to provide deep learning methods for emotion recognition. A detailed explanation of the processes in this work is given below.

A. MFCC
Preprocessed data will be further processed into MFCC to get the 2D representation of the spectrogram from audio data. This is necessary because the convolutional process in CNN requires all data represented in the image, in this case, a 2D spectrogram image. Using constant value padding, all output vectors from MFCC were fixed in size. Audio files used in the MFCC process have a 16kHz framerate and mono encoding. Output from MFCC extractions are 20x11 vector matrix.

B. CNN-LSTM
In this study, CNN-LSTM is used as a means of training, validating, and testing the learning and recognition model built in the research, by using the Keras library in Python. MFCC feature output from the process will be fed into the CNN network. The convolutional method in CNN will extract samples from dataset features provided in MFCC by convoluting the sample to extract diminishing features from the dataset. After the convolutional and pooling process, the fully-connected layer of CNN will be connected to the LSTM layers. The step-by-step of this process will be further explained below.
Convolution -we used 64 convolutional layers, with three by three kernels, to extract data from the sample with ReLU activation.
MaxPooling -to decrease the samples, we used two-by-two pool sizes with the "same" padding.
Dropout -The dropout method used for regularisation means to reduce the overfitting probability. 0.25 probability used in the Dropout layer Convolution -128 convolutional layers, with two by two kernels to further extract diminishing features from convoluted layers.
MaxPooling -another two-by-two pool size is used.
Dropout -same 0.25 probability used in this layer.
Flatten -flatten the output to become a fully connected layer, to make the sure output of CNN will be connected with the LSTM layer.
LSTM -we used two layers of LSTM to acquire information from the output of CNN with "ReLU" activation.
Dense -model will be condensed into 3 classes, with "Softmax" activation.

C. CNN-LSTM
The hardware robot used in this research is a humanoid robot from UBTech. This robot will be dismantled and the main CPU in the system will be replaced with a Raspberry Pi board. The Raspberry Pi will be used as a place for running the machine learning model, and where a webcam is connected to the system. The microphone from the webcam will be used to record voice commands, which are then processed with MFCC and classified by CNN-LSTM layers. The result of the classification is a command for the robot, which will then be used as input for the Arduino Uno. This is because the robot's motor itself is driven using Arduino. The illustration of the robot will be shown in Figure 3. The wiring diagram of the robot will be shown in Figure 5.  The servo used in this robot is operated by sending a byte array to the servo. The servo itself is daisychained for each limb, so the Tx pin for moving the robot is split into 4 channels. The bytes array itself is consist of 10 bytes which is shown in Table III below. The servo ID is already predetermined by the manufacturer. Op Mode is like the "servo.attach" command when using the Arduino Servo library, so the servo will be initialized and energized when "attached" and de-energized when "detached". Mode 1 is for the "attach" function and Mode 2 is for the "detach" function. The degree is the value of desired servo degree, ranging from 0 to 180. Duration is for determining how fast the servo must attain the desired degree.

A. Word Recognition
For word recognition and classification, the CNN-LSTM network is used as a means for training and testing the machine learning model. This classification is done by using the Keras library in Python. On the CNN side, 128 convolution layers were used, with a kernel size of 2x2 matrix and ReLU activation. And then, the kernel will be pooled using the 2D MaxPooling technique with a 2x2 matrix size.
As this work uses limited data on non-linear hidden layers of the deep neural network, the tendency for overfitting to occur is high [15]. So Dropout method is used to prevent overfitting to occur. The Dropout coefficient used is 0.25.
The LSTM layers in the model are engaged by defining CNN layers within the "TimeDistributed" function from Keras. After building the layer, the model will be completed by using the "Dense" function to build a Fully Connected Layer (FC Layer). The final result will be compiled using the Adadelta optimizer. After compilation, the model will be tested for 150 epochs using 80% audio samples from the data set for training and 20% for testing. For the accuracy of training and testing from all emotion datasets, depicted in Fig. 3. The training process from 80% of data, peaked at 90,36%. And the testing process used 20% of the data and peaked at 67%.

B. Emotion Recognition
Experiments in this research were conducted in 3 scenarios to fully validate the classification accuracy. All scenarios were conducted with 80:20 split data ratios. The parameter in CNN-LSTM networks used in the scenarios is ReLU activation CNN networks. For LSTM, the Softmax activation method is used for the final output. 150 epoch set for all scenarios with final compilation using the "Adadelta" optimizer.  3. Scenario 3 -test the accuracy of the "angry" expression on the model. Scenarios 1 to 3 will try to recognize which audio sample is classified as an emotional expression and to check the accuracy compared to the null model or undefined expression. The averaged result of "happy," "sad," and "angry" is shown in Table V. For the accuracy of training and testing from all emotion datasets, depicted in Fig. 7. The training process from 80% of data, peaked at 90,36%. And the testing process used 20% of the data and peaked at 67%

C. Robot Movement
The movement of the robot will depend on the classification results that have been obtained. The classification results that have been successfully obtained from the Raspberry Pi will be sent to the microcontroller. The robot will be driven by using serial communication between the microcontroller and the robot's servo motor. The movement of the robot being tested is for 4 types of movement, namely "Tegak", "Duduk", "Kanan", and "Kiri". This type of movement was chosen because this type of movement allows it to be carried out without analyzing the balance of the robot's movement. The first test is carried out using the data used in the machine learning model as input. The first experiment result is shown in Table VII.  And then for the final testing scenario, the input data for the robot uses a new voice input recorded using a webcam microphone. The user must speak within 10-20 cm of the webcam. This is because the webcam microphone was not good enough to capture the user's voice. The result of the final experiment is shown in Table VIII below.

V. CONCLUSION
Our preliminary study on word-level emotion recognition results shows that this model has an acceptable performance. And from the confusion matrix result shows that the model accuracy is around 65%. These results should be improved in future works by adding more data and also using and comparing different layers of CNN and LSTM to determine how deep the network should be used for emotion recognition based on a self-built database like this. For robot implementation, the pre-recorded scenario shows the model can satisfy the movement classification with 100% classification, but when the model is introduced with new input data, it fails miserably as most input is classified as "Duduk" with accuracy only 20% from 10 data. This result is strong evidence for showing this model has an overfitting tendency. This is the main issue that must be solved in future works.