Publications | Md. Shalha Mucha Bhuyan

Core Research Areas

Multimodal Deep Learning
Large Language Models (LLMs)
AI for Cybersecurity
Text Classification
Computer Vision

Selected Publications

BVQA: Connecting Language and Vision Through Multimodal Attention for Open-Ended Question Answering 2025

Venue: IEEE Xplore • Journal Rank: Q1 • Impact Factor: 3.5
Authors: Md. Shalha Mucha Bhuyan, Eftekhar Hossain, et al.

Visual Question Answering (VQA) is a challenging problem of Artificial Intelligence (AI) that requires an understanding of natural language and computer vision to respond to inquiries based on visual content within images. Research on VQA has gained immense traction due to its wide range of applications in aiding visually impaired individuals, enhancing human-computer interaction, facilitating content-based image retrieval systems, etc. While there has been extensive research on VQA, most were predominantly focused on English, often overlooking the complexity associated with low-resource languages, especially in Bengali. To facilitate research in this arena, we have developed a large scale Bengali Visual Question Answering (BVQA) dataset by harnessing the in-context learning abilities of the Large Language Model (LLM). Our BVQA dataset encompasses around 17,800 diverse open-ended QA Pairs generated from the human-annotated captions of ≈3,500 images. Replicating existing VQA systems for a low-resource language poses significant challenges due to the complex nature of their architectures and adaptations for particular languages. To overcome this challenge, we proposed Multimodal CRoss-Attention Network (MCRAN), a novel framework that leverages pretrained transformer architectures to encode the visual and textual information. Furthermore, our method utilizes a multi-head attention mechanism to generate three distinct vision-language representations and fuses them using a sophisticated gating mechanism to answer the query regarding an image. Extensive experiments on the BVQA dataset show that the proposed method outperformed the existing baseline across various answer categories.

Advancing Bangla Sign Language Recognition Through Deep Learning Techniques 2025

Venue: NCIM 2025, Gazipur, Bangladesh (Accepted)
Authors: Fahmida Ahamed Ifty, Md. Shalha Mucha Bhuyan, et al.

The importance of Bangla Sign Language (BdSL) is growing as the number of users continues to increase. However, research in this field remains underexplored. This study emphasizes the significance of BdSL and proposes an automated interpreter system for detecting and recognizing signs. To address the scarcity of established datasets, we developed a custom video dataset comprising 784 sample videos to facilitate empirical research and model training. These videos capture dynamic signs, enabling more effective detection. We experimented with various deep learning models, incorporating diverse pre-processing and frame extraction techniques. Our proposed 3D Convolutional Neural Network (3D CNN) architecture outperformed other models, achieving an accuracy of 82.6%. The model was further evaluated using multiple performance metrics and tested for some recorded videos. Overall, the classification results were robust across all sign classes, demonstrating the effectiveness of our approach.

Undergraduate Thesis

Attention-Based Bengali Visual Question Answering: Enhancing Bengali Language Understanding in Multimodal AI Systems 2024

Supervisor: Khaleda Akhter Sathi, Department of ETE, CUET

Answering questions about visual content, known as Visual Question Answering (VQA), is a substantial challenge in the field of Artificial Intelligence (AI). This involves an AI agent responding to questions based on the visual information contained in given images. While there has been extensive research in the field of Visual Question Answering (VQA) in English, the exploration of this area in the Bengali language has been limited due to the scarcity of available datasets. To overcome this challenge, we have developed a Bengali VQA dataset, named BengaliVisQA, employing a novel algorithm centered around prompt engineering. Images for this dataset were gathered from the BanglaLekhaImageCaptions dataset. This initiative aims to pave the way for advancements in artificial intelligence within the Bengali language context. The recent research models often struggle to focus on the relevant image region mentioned in the question, leading to inaccuracies in their answers. To overcome this limitation, we introduce a novel approach called BanVQA-Net, a multi-head attention-based multimodal fusion network designed to enhance answer prediction. Our model leverages CNN, specifically ResNet50, to encode input images, while Bangla Bert is employed for question encoding and feature extraction. Ultimately, we evaluate our BanVQA-Net model by comparing it to two existing models that consider detailed relationships at various levels: word, region, and interaction. Additionally, our model can focus on different visual and textual elements separately, which is crucial for making accurate answers.The significance of this study lies in its relevance for Bengali speakers utilizing VQA-based smart systems, such as virtual doctors, navigation systems, and smart glasses for visually impaired individuals, to improve the usability and comprehensibility of these applications for those who may not be proficient in foreign languages like English.