Document Type

Thesis

Degree Name

Master of Applied Computing

Department

Physics and Computer Science

Program Name/Specialization

Applied Computing

Faculty/School

Faculty of Science

First Advisor

Dr. Jiashu Zhao

Advisor Role

Supervisor

Abstract

The rapid advancement of generative artificial intelligence, particularly Large Language Models (LLMs) such as GPT-4 and their multilingual capabilities, has significantly blurred the distinction between human-authored and machine-generated content. This technological evolution introduces critical challenges concerning the detection and attribution of textual authenticity and authorship, exacerbating societal issues like misinformation proliferation and compromising academic and professional integrity. Traditional detection methodologies, predominantly monolingual and heuristic-based, have demonstrated inadequate generalizability and efficacy against the sophisticated, multilingual capabilities of contemporary generative models.

This thesis addresses two major problems arising from these advancements. Firstly, it introduces novel multilingual detection methodologies explicitly designed to differentiate human-written text from machine-generated content across diverse languages. We present two innovative approaches: a transformer-based hybrid learning framework leveraging multilingual pretrained language models (PLMs), and a stylometric-based classifier specifically designed for interpretability and low-resource environments. Extensive experiments conducted on the multilingual MULTITuDE dataset encompassing eleven languages demonstrate superior detection accuracy and robustness of our PLM-based hybrid classifier compared to ``state of the art'' methods. Concurrently, the stylometric classifier offers valuable forensic and interpretative insights, performing effectively under computational constraints and resource limitations.

Secondly, the thesis addresses the urgent challenge of combating AI-generated misinformation and fake news through a multitask learning framework that simultaneously classifies textual authenticity (real vs. fake) and authorship (human vs. AI). The proposed Shared-Private Synergy Model (SPSM), alongside hierarchical and prompt-based classifiers, significantly outperforms traditional single-task methods on the newly introduced FAANR dataset. Comprehensive experimentation, including ablation studies and interpretability analyses employing SHAP and LIME techniques, underscores the effectiveness and transparency of these methodologies, ensuring stakeholder trust and facilitating informed decision-making.

Overall, the thesis substantially contributes to the fields of multilingual machine-generated text detection and AI-driven misinformation classification by providing novel datasets, methodological innovations, and extensive interpretability analyses. While limitations exist concerning computational efficiency, adversarial robustness, and real-world applicability, the research outcomes establish a robust foundation for future studies, ensuring improved societal resilience against misinformation, enhanced academic and professional integrity, and greater transparency in digital communications.

Recommended Citation

Chhatwal, Gurunameh Singh, "Multi-Lingual and Cross-Domain Frontiers in Machine-Generated Content Detection" (2025). Theses and Dissertations (Comprehensive). 2770.
https://scholars.wlu.ca/etd/2770

Convocation Year

2025

Convocation Season

Spring

Download

Included in

Artificial Intelligence and Robotics Commons, Data Science Commons

COinS

Theses and Dissertations (Comprehensive)

Multi-Lingual and Cross-Domain Frontiers in Machine-Generated Content Detection