Journal of Data Science and Artificial Intelligence

Protein secondary structure prediction based on LSTM neural network approach

2026-04-30T12:10:24-04:00

Proteins are crucial for maintaining cellular, organ, and tissue structure and function in the body, predicting protein structure and function from sequence remains challenging. Computational methods are essential for predicting protein properties. In Vietnam, protein technology and bioinformatics have gained attention, especially during the Covid-19 pandemic. This study presents a deep learning approach using a four-layer LSTM model and protein datasets to predict protein secondary structure. The proposed method incorporates RPCA to reduce dimensionality, eliminate errors and outliers, and enhance machine learning model effectiveness. The accuracy of the LSTM model is 88.73\%, surpassing modern methods. While limitations and challenges exist, this research contributes to organizing knowledge and building an experimental program for protein function prediction. The proposed method provides a valuable tool for accurately predicting protein function based on secondary structure. Future studies using deep learning, projected protein sequences and structures hold promise for further advancements in protein function prediction.

CycleGAN-Based Drunk Synthesis and Attention-Enhanced MobileNetV2 for Driver State Recognition

2025-12-03T02:37:40-05:00

Driver monitoring plays a critical role in intelligent transportation systems, yet detecting fatigue, alcohol impairment, and distraction remains highly challenging due to their subtle and overlapping cues. A major obstacle is the scarcity of annotated alcohol-related data, which limits the training of robust classifiers. This work presents a practical framework that combines adversarial data augmentation with lightweight deep learning for unified driver state recognition. A CycleGAN-based pipeline synthesizes alcohol-impaired facial images from fatigue samples, introducing physiologically motivated effects such as skin flushing, gaze irregularities, and periocular redness. For classification, a MobileNetV2 backbone enhanced with Squeeze-and-Excitation (SE) attention adaptively emphasizes critical channels while remaining computationally efficient. Evaluated on a curated seven-class dataset, the model achieves 97.67% accuracy with a test loss of 0.0655, confirming the viability of adversarial augmentation coupled with lightweight CNNs for real-time driver monitoring.

Comprehensive Business Ranking in Vietnam Utilizing Adjusted Reciprocal Rank Fusion

2026-03-05T14:25:19-05:00

Vietnam is currently one of the fastest-growing economies in the world, with companies in a rapid race to become one of the best. However, the lack of a comprehensive ranking in terms of corporate performance in Vietnam makes it a challenge to accurately measure the success of businesses as a whole picture; relying on individual ranking systems might not be sufficient in measuring all aspects. This study addresses the limitations of individual ranking systems by creating a proof-of-concept comprehensive framework ranking system for 243 Vietnamese companies using an adjusted Reciprocal Rank Fusion (RRF) algorithm that integrates twelve domestic and international business rankings. The Adjusted RRF incorporates AI-generated reputation scores from the Large Language Model GPT-o4 to determine source weights with human revision and apply time-decay factors to account for the temporal relevance of ranking data. The application of Adjusted RRF enables significant ranking shifts, with the adjusted model enhancing the alignment between the ranking score and real-life firm-level attributes compared to the original RRF. Statistical analysis using multiple linear regression and predictive power score testing identified total assets, net revenue, and number of employees as significant predictors of ranking performance, with predictive power scores exceeding 0.25 for all variables tested. The potential of the adjusted ranking model provides a more accurate representation of current business performance in the Vietnamese market, offering a valuable reference for job seekers, policymakers, and recruiters in identifying reliable companies in Vietnam.

An Adaptive ROI-Based Framework for Traffic Congestion Detection in Mixed Traffic Environments Using Deep Learning

2026-04-27T10:07:44-04:00

Traffic congestion is a critical issue in urban areas, leading to increased travel time, fuel consumption, and environmental pollution. This paper proposes a real-time traffic congestion detection system based on deep learning and multi-object tracking. The system utilizes the YOLOv8 model for vehicle detection and ByteTrack for tracking vehicles across video frames. A region of interest (ROI) is defined to focus on relevant traffic areas, and congestion is determined using two key metrics: vehicle density and Motorcycle Equivalent Unit (MEU) - based traffic representation. Experimental results demonstrate that the proposed system achieves competitive detection accuracy, with a mean Average Precision (mAP@0.5) of 0.911 for vehicle detection. The system is able to distinguish between congested and non-congested traffic conditions in real-world scenarios. These results indicate that the proposed approach is suitable for intelligent traffic monitoring and smart city applications.

Modeling and Application of Explainable Artificial Intelligence for Stroke Prediction

2026-05-20T04:38:46-04:00

Stroke remains a major global health concern, contributing signifi cantly to mortality and long-term disability. Early and
accurate prediction can improve patient outcomes; however, traditional machine learning (ML) models often lack
transparency. In this study, we develop a stroke prediction framework that combines machine learning algorithms with
Explainable Artificial Intelligence (XAI) techniques to improve both predictive performance and model interpretability.
By imple menting and comparing three ML algorithms—Random Forest, Support Vector Machine, and Logistic
Regression—alongside two XAI methods, SHAP and LIME, this study offers a pathway toward interpretable and
trustworthy AI in medical contexts. Experimental results on the held-out test set showed that Random Forest achieved the
best performance, with 92.8% accuracy, 95.3% recall, 90.7% precision, 92.9% F1-score, and 92.8% ROC AUC.
Furthermore, SHAP analysis identified age, average glucose level, and BMI as the most influential features, while LIME
provided instance-level insights into individual predictions. The findings suggest that combining machine learning with
explainability techniques can support more transparent stroke risk prediction and may assist clinical decision-making when
further validated on larger and more diverse clinical datasets.

Machine Learning-based RFID Reader for Power Recommendation to assist Attendance

2025-06-08T10:06:32-04:00

Ultra-high-frequency radio frequency identification (UHF-RFID) technology provides a promising and cost-effective solution for tracking and positioning applications. However, its performance is often affected by signal attenuation, environmental variability, and radio interference. Among the influencing factors, transmission power plays a critical role in the successful detection of RFID tags. Insufficient transmission power can lead to missed reads, while excessive power may result in signal collisions or interference with nearby systems, compromising system reliability and scalability. To address the challenges of the successful detection of RFID tags, this paper proposes a custom designed UHF reader for detection framework to solve (i) featuring circular polarization to enhance the flexibility of tag orientation and reduce polarization mismatch; (ii) by integrating machine learning (ML), a system can autonomously classify environmental conditions and adjust power levels accordingly. This paper also proposes a ML-supported UHF-RFID detection framework that improves the accuracy of tag classification and automates the adjustment of transmission power in various indoor and outdoor scenarios. By leveraging the RSSI dataset and introducing an extended dataset enriched with new features and conditions, we employ ML to optimize power consumption and improve detection efficiency. Furthermore, to ensure confidentiality during transmission for collected data, encrypt and decrypt CSV (Comma-Separated Values) file are proposed. Lastly, a list of performance analysis for recommendation based on supervised learning, unsupervised learning, deep learnings are considered to solve model and device selection in RFID environments, clustering in RFID Tag, and recommended usage for indoor and outdoor environments, respectively. These recommendations serve as practical guidelines for deploying RFID-based classification systems with improved accuracy and environmental adaptability. Our experimental confirm attendance rates are notably high, with median values consistently ranging between 80% and 95%, reflecting a strong level of student engagement and participation.

Design and Manufacturing of a Control System for a Pneumatic Robotic Arm for Training Purposes

2025-09-21T23:49:18-04:00

This study focuses on designing and manufacturing a new control system for an industrial pneumatic robotic arm, specifically to meet the practical training needs of students in the field of mechatronics. The primary objective of this research is to build an "open" control system, allowing learners to freely connect input and output signals to program the robotic arm according to specific operational requirements. This design enhances students' autonomy and understanding of control systems. It supports flexible and easy adjustments to the robot’s operation, helping students gain a stronger grasp of robotic arm control and operation in a real industrial setting. The research methodology consists of three main steps. First, the study applies a programmable logic controller (PLC) as the system’s foundation. The PLC was chosen for its flexibility and practical relevance, making it easier for students to engage with industrial applications. Second, the study explores the structure of the pneumatic drive system, control factors, and necessary conditions for stable operation in the robotic arm’s pneumatic system. Finally, the research identifies the essential actuators and piston position sensors to ensure effective operation of the robotic arm. The findings demonstrate that the new control system, developed using PLC technology, can flexibly meet practical training requirements and holds great potential for application in technical education. This system not only fulfills basic control needs but also provides extended functionality, allowing students to easily modify the system to practice programming skills for controlling pneumatic robotic arms. As a result, it lays a solid foundation for training in the field of mechatronics.

Skin Cancer Detection on Smartphone Images through Knowledge Distillation of Multimodal Deep Learning Models

2025-06-18T22:31:35-04:00

Skin cancer remains a critical global health challenge, particularly in regions with limited access to specialized diagnostic tools. This study presents an innovative approach for skin lesion classification using non-dermoscopic smartphone images, leveraging knowledge distillation to enhance model efficiency and accuracy. We utilize the PAD-UFES-20 dataset, which comprises 2,298 smartphone-captured images of six distinct skin lesion types, accompanied by comprehensive patient metadata. Our methodology involves a teacher–student framework, where a ConvNeXt-based teacher model integrated with Convolutional Block Attention Modules (CBAM) and a metadata encoder transfers its learned representations to a more compact EfficientNet-B0 student model. The distillation process incorporates logit matching, feature similarity, and attention transfer, enabling the student model to achieve performance parity with the teacher while significantly reducing computational overhead. Experimental results demonstrate that the student model attains an accuracy of 80.43% and a weighted F1-score of 80.16%, closely mirroring the teacher's performance. Additionally, the integration of metadata and attention mechanisms substantially improves classification robustness, particularly for underrepresented lesion categories. The proposed framework effectively addresses class imbalance through the application of focal loss, enhancing the model's ability to detect clinically significant but less frequent skin lesions. This approach offers a viable solution for deploying accurate skin cancer diagnostic tools on resource-constrained mobile devices, thereby expanding access to essential healthcare services in underserved communities.

Assessing HIV Mortality Trends in Relation to Prevalence and Diagnosis Rates: A Unified Approach with Machine Learning and Optimal Design

2025-10-05T22:47:42-04:00

HIV/AIDS remains a significant public health challenge in the world, despite advancements in treatment and prevention over the past few decades. Owing to this fact, we study the relationship between HIV mortality trends, prevalence, and diagnosis rates by employing machine learning methods and optimal design theory. Using advanced machine learning models such as Multiple Linear Regression (MLR), Random Forest (RF), Gradient Boosting Machine (GBM), Extreme Gradient Boosting (XGBoost), and Support Vector Regression (SVR), we predicted HIV-related death rates. The analysis incorporates I-optimal design, a powerful methodology that minimizes prediction variance to enhance model optimization and reliability. Data preprocessing ensured high-quality inputs by addressing missing values, standardizing variables, and handling outliers. The findings reveal that SVR outperformed other models with the lowest mean squared error and the highest R². Moreover, integrating I-optimal design improved linear model performance. These results highlight the importance of aligning data design methodologies with model complexity to inform public health interventions. The study underscores the value of optimal design and machine learning in guiding evidence-based resource allocation and improving health outcomes for HIV-affected populations.

A Deterministic Extractive Framework for Evidence-Grounded Scientific Claim Verification

2026-01-25T21:59:22-05:00

Large language models (LLMs) are increasingly applied to scientific question answering, yet their outputs often contain statements lacking explicit evidence. In high-stakes domains such as biomedicine, ensuring traceability to source documents is essential for interpretability and reliability. While retrieval-augmented generation (RAG) systems leverage external documents, most pipelines do not strictly enforce evidence dependence during answer construction.

We introduce Citation-Driven Extractive Claim Verification (CD-ECV), a deterministic, non-generative extractive framework for scientific claim verification — not a hallucination-reduction system for generative LLMs. D-ECV retrieves biomedical literature via sparse lexical ranking BM25), applies sentence-level lexical and semantic filtering, and constructs responses exclusively from unmodified evidence spans. Importantly, CD-ECV guarantees source traceability, not factual truth: if retrieved evidence is incorrect or outdated, outputs remain traceable but may not reflect scientific ground truth.

We evaluate CD-ECV on the SciFact benchmark, (300 claims labelled SUPPORT, CONTRADICT, or NOT_ENOUGH_INFO), using a corpus of 5183 biomedical passages. Metrics include retrieval recall, evidence selection precision, label accuracy, and abstention rate. All non-abstaining outputs consist entirely of verbatim retrieved evidence spans. These results establish CD-ECV as a deterministic, non-generative extractive baseline for citation-grounded scientific claim verification, providing a transparent and reproducible reference point and enabling future integration with neural validation or generative components.