Hybrid CNN-LSTM with Soft Attention for Real-Time ASL Gesture Recognition using Mediapipe Landmarks

Authors

Akhilesh Rawat, Siddharth Chordia, Saksham Jain and Khush Mangla, Bennett University, India

Abstract

This paper introduces a real-time American Sign Language (ASL) recognition system using a hybrid deep learning model that integrates 1D Convolutional Neural Networks(Conv1D), Bidirectional Long Short-Term Memory (BiLSTM) layers, and a custom soft attention mechanism. The model processes 60-frame sequences of 3D pose and hand landmarks (258 features per frame) extracted using Mediapipe, allowing the system to learn both spatial and temporal gesture patterns without relying on raw image input. A custom dataset comprising 6,800 gesture sequences across 17 classes—including 16 ASL signs and an idle state—was curated using spatial data augmentation techniques such as mirroring and rotational transformations to ensure robustness across diverse users, lighting, and angles. The model achieved a high testing accuracy of 95.88%, with 96.87% precision and 95.88% recall, and significantly outperformed GRU, RNN, and CNN-RNN baselines in comparative evaluations. Despite its architectural complexity, the system performs efficiently on low-power hardware such as a MacBook Air M2, enabling real-time inference suitable for integration into live video conferencing platforms. The proposed solution offers a scalable, lightweight, and accessible approach to gesture-based communication, advancing the practicality of deep learning in inclusive assistive technologies. Index Terms—LSTM, GRU, RNN, ASL Gesture, BiLSTM, Mediapipe.

Keywords

ASL Gesture Recognition, Hybrid Deep Learning, BiLSTM with Attention, Mediapipe Landmarks, Real-Time Inference.

IJCI Conference Proceedings

Hybrid CNN-LSTM with Soft Attention for Real-Time ASL Gesture Recognition using Mediapipe Landmarks

Authors

Abstract

Keywords