Speech Emotion Recognition using CNN

Machine learning model that detects human emotions and affective states from speech using convolutional neural networks.

Tech Stack

PythonTensorFlowCNNAudio ProcessingJupyter Notebooks

About

A deep learning project focused on detecting human emotions and affective states from speech patterns. The system leverages the principle that tone and pitch of voice frequently reflect underlying emotion. This research demonstrates practical applications across multiple industries including call center analytics, driver safety monitoring, and customer satisfaction assessment. The project combines four publicly available emotional speech datasets (CREMA-D, RAVDESS, SAVEE, TESS) to train a CNN-based classifier capable of recognizing diverse emotional expressions.

Technical Approach

Developed using TensorFlow for building and training the CNN architecture. The implementation includes two Jupyter notebooks: one for comprehensive model training and hyperparameter tuning, another for testing on custom audio inputs. The pipeline involves: - Audio feature extraction (spectrograms, MFCCs) - CNN architecture optimized for audio signal processing - Multi-dataset training for robust emotion classification - Python audio recording module for real-world testing - Integration with custom speech samples

Key Achievements

✓Trained CNN model on combined dataset of 4 major emotional speech databases
✓Successfully classifies multiple emotion types from speech patterns
✓Implemented custom audio recording module for real-time testing
✓Demonstrated practical applications in call center and vehicle safety scenarios
✓Created reproducible training and testing notebooks

Challenges & Learning

•Managing class imbalance across different datasets with varying emotion distributions
•Handling diverse audio qualities and recording conditions across datasets
•Feature extraction optimization for CNN architecture on audio data
•Model generalization across different speaker accents and languages
•Processing computational requirements for large audio datasets
•Validating emotion detection accuracy across diverse speaker populations

View on GitHub