The majority of studies developed to recognize human emotions are limited to a single modality, namely, facial expressions or speech. This book introduced a multimodal approach to improve the accuracy of the emotion recognition by combining audio and visual data. Furthermore, a CNN model has been proposed to automatically extract facial features that uniquely differentiate facial expressions, and this method has been applied to recognize the cognitive states of learners in E-learning environments, and the learners' facial expressions are mapped to cognitive states such as boredom, confusion, engagement, and frustration. The objectives are as follows: - Multimodal feature extraction and fusion from face image and speech: Geometric-based, SURF features from face image are considered, as are spectral and prosodic features from speech.- To combine the scores obtained from individual models, the proposed linear weighted fusion approach was used.- To recognize learners' cognitive states in e-learning environments, a Hybrid CNN model has been proposed.