A Multimodal Voice Phishing Detection System Integrating Text and Audio Analysisopen access
- Authors
- Kim, Jiwon; Gu, Seuli; Kim, Youngbeom; Lee, Sukwon; Kang, Changgu
- Issue Date
- Oct-2025
- Publisher
- MDPI
- Keywords
- voice phishing detection; multimodal learning; audio forgery analysis; transformer-based text classification
- Citation
- Applied Sciences-basel, v.15, no.20
- Indexed
- SCIE
SCOPUS
- Journal Title
- Applied Sciences-basel
- Volume
- 15
- Number
- 20
- URI
- https://scholarworks.gnu.ac.kr/handle/sw.gnu/80791
- DOI
- 10.3390/app152011170
- ISSN
- 2076-3417
2076-3417
- Abstract
- Voice phishing has emerged as a critical security threat, exploiting both linguistic manipulation and advances in synthetic speech technologies. Traditional keyword-based approaches often fail to capture contextual patterns or detect forged audio, limiting their effectiveness in real-world scenarios. To address this gap, we propose a multimodal voice phishing detection system that integrates text and audio analysis. The text module employs a KoBERT-based transformer classifier with self-attention interpretation, while the audio module leverages MFCC features and a CNN-BiLSTM classifier to identify synthetic speech. A fusion mechanism combines the outputs of both modalities, with experiments conducted on real-world call transcripts, phishing datasets, and synthetic voice corpora. The results demonstrate that the proposed system consistently achieves high values regarding the accuracy, precision, recall, and F1-score on validation data while maintaining robust performance in noisy and diverse real-call scenarios. Furthermore, attention-based interpretability enhances trustworthiness by revealing cross-token and discourse-level interaction patterns specific to phishing contexts. These findings highlight the potential of the proposed system as a reliable, explainable, and deployable solution for preventing the financial and social damage caused by voice phishing. Unlike prior studies limited to single-modality or shallow fusion, our work presents a fully integrated text-audio detection pipeline optimized for Korean real-world datasets and robust to noisy, multi-speaker conditions.
- Files in This Item
- There are no files associated with this item.
- Appears in
Collections - ETC > Journal Articles

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.