Speech Emotion Recognition Using Convolutional Neural Network and Data Augmentation Technique
The purpose of speech emotion recognition systems is to create an emotional connection between humans and machine, since recognizing human emotions and goals helps improve interactions between humans and machines. Recognizing emotions through speech has been a challenge for researchers over the past decade. But with advances in artificial intelligence, these challenges have faded. In this study, we took steps to improve the efficiency of these systems by using deep learning methods. In the first step, three-dimensional Convolutional neural networks are used to learn the spectral-temporal Features of speech. In the second step, to strengthen the proposed model, We use the New pyramidal Concatenated three-dimensional Convolutional neural networks, Which is a multi-scale architecture of three-dimensional Convolutional neural networks on input dimensions. Finally, to obtain the ability of learning the spectral-temporal features extracted from the New Pyramidal Concatenated 3D CNN Approach, we used the temporal capsule network, so could be called consider the spatial and temporal relationship of the data. Finally, we named the proposed structure, which is a powerful structure for spectral-temporal feaures, the MSID 3DCNN + Temporal Capsule.The final model has been applied on a combination of two speech and song databases from the RAVDESS database. comparing the results of the proposed model with the conventional models, shows the better performance of our approach. The proposed SER model has achieved an accuracy of 81.77% for six emotional classes by gender.
- حق عضویت دریافتی صرف حمایت از نشریات عضو و نگهداری، تکمیل و توسعه مگیران میشود.
- پرداخت حق اشتراک و دانلود مقالات اجازه بازنشر آن در سایر رسانههای چاپی و دیجیتال را به کاربر نمیدهد.