Document Type
Thesis
Degree Name
Master of Applied Computing
Department
Physics and Computer Science
Program Name/Specialization
Applied Computing
Faculty/School
Faculty of Science
First Advisor
Dr. Jiashu Zhao
Advisor Role
Supervisor
Abstract
The rapid advancement of generative artificial intelligence, particularly Large Language Models (LLMs) such as GPT-4 and their multilingual capabilities, has significantly blurred the distinction between human-authored and machine-generated content. This technological evolution introduces critical challenges concerning the detection and attribution of textual authenticity and authorship, exacerbating societal issues like misinformation proliferation and compromising academic and professional integrity. Traditional detection methodologies, predominantly monolingual and heuristic-based, have demonstrated inadequate generalizability and efficacy against the sophisticated, multilingual capabilities of contemporary generative models.
This thesis addresses two major problems arising from these advancements. Firstly, it introduces novel multilingual detection methodologies explicitly designed to differentiate human-written text from machine-generated content across diverse languages. We present two innovative approaches: a transformer-based hybrid learning framework leveraging multilingual pretrained language models (PLMs), and a stylometric-based classifier specifically designed for interpretability and low-resource environments. Extensive experiments conducted on the multilingual MULTITuDE dataset encompassing eleven languages demonstrate superior detection accuracy and robustness of our PLM-based hybrid classifier compared to ``state of the art'' methods. Concurrently, the stylometric classifier offers valuable forensic and interpretative insights, performing effectively under computational constraints and resource limitations.
Secondly, the thesis addresses the urgent challenge of combating AI-generated misinformation and fake news through a multitask learning framework that simultaneously classifies textual authenticity (real vs. fake) and authorship (human vs. AI). The proposed Shared-Private Synergy Model (SPSM), alongside hierarchical and prompt-based classifiers, significantly outperforms traditional single-task methods on the newly introduced FAANR dataset. Comprehensive experimentation, including ablation studies and interpretability analyses employing SHAP and LIME techniques, underscores the effectiveness and transparency of these methodologies, ensuring stakeholder trust and facilitating informed decision-making.
Overall, the thesis substantially contributes to the fields of multilingual machine-generated text detection and AI-driven misinformation classification by providing novel datasets, methodological innovations, and extensive interpretability analyses. While limitations exist concerning computational efficiency, adversarial robustness, and real-world applicability, the research outcomes establish a robust foundation for future studies, ensuring improved societal resilience against misinformation, enhanced academic and professional integrity, and greater transparency in digital communications.
Recommended Citation
Chhatwal, Gurunameh Singh, "Multi-Lingual and Cross-Domain Frontiers in Machine-Generated Content Detection" (2025). Theses and Dissertations (Comprehensive). 2770.
https://scholars.wlu.ca/etd/2770
Convocation Year
2025
Convocation Season
Spring