Self-consistency and Graph-based Filtering to Enhance Synthetic Arabic SMS Generation for Smishing Detection

Amal Alotaibi; Miada Almasre; Hadeel Surougi; Mona Alkhozae; Nouf Alghanmi

doi:10.47738/jads.v7i1.1033

Self-consistency and Graph-based Filtering to Enhance Synthetic Arabic SMS Generation for Smishing Detection

Amal Alotaibi, Miada Almasre, Hadeel Surougi, Mona Alkhozae, Nouf Alghanmi

Abstract

Smishing or SMS phishing is a growing cybersecurity threat in mobile security, with Arabic-speaking regions particularly vulnerable due to the absence of large, labeled datasets. The main objective of this study is to develop a scalable pipeline that can generate and classify Arabic SMS messages to overcome the lack of data and enhance detection performance. The contributions are threefold: (i) constructing a balanced dataset of 6,903 messages by combining 903 synthetic samples with 6,000 real Arabic SMS messages; (ii) introducing a hybrid generation framework that integrates a fine-tuned GPT-3.5-turbo language model with Conditional WGAN embeddings, refined using self-consistency sampling and graph-based redundancy filtering; and (iii) evaluating the dataset using multiple machine learning (Logistic Regression, Random Forest, SVM) and deep learning (CNN, BERT) models. The pipeline unifies adversarial embedding generation, large language model fine-tuning, and cosine similarity filtering. Experimental results show consistently strong performance: Logistic Regression and Random Forest both achieved accuracy of 0.9949 and F1-score of 0.9950, while SVM outperformed all with accuracy 0.9957 and F1-score 0.9957. Among deep learning models, CNN reached accuracy 0.9942 and F1-score 0.9942, and BERT achieved 0.9900 across all metrics. These findings confirm that while SVM is most effective for this dataset, CNN and BERT add robustness by capturing semantic subtleties. Visual analyses, including confusion matrices and t-SNE projections, validated the overlap between real and synthetic embeddings, while comparative tables positioned this study within the context of recent Arabic smishing research. The novelty of this work lies in combining self-consistency and graph-based filtering within a hybrid generation-classification pipeline tailored for Arabic SMS, providing a reproducible framework extendable to low-resource, multilingual, and cross-platform environments such as WhatsApp and Telegram.

Article Metrics

Abstract: 2 Viewers PDF: 2 Viewers

Keywords

Smishing Detection; Synthetic Data Augmentation; Large Language Models (LLMs); Arabic SMS Generation; Conditional WGAN

Cite:

How to cite item

Full Text:

PDF

DOI: https://doi.org/10.47738/jads.v7i1.1033

Citation Analysis:

Refbacks

There are currently no refbacks.

Journal of Applied Data Sciences

ISSN	:	2723-6471 (Online)
Collaborated with	:	Computer Science and Systems Information Technology, King Abdulaziz University, Kingdom of Saudi Arabia.
Publisher	:	Bright Publisher
Website	:	http://bright-journal.org/JADS
Email	:	taqwa@amikompurwokerto.ac.id (principal contact)
		support@bright-journal.org (technical issues)

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0

Username
Password
Remember me