Document Type

Thesis

Degree Name

Master of Applied Computing

Department

Physics and Computer Science

Faculty/School

Faculty of Science

First Advisor

Yang Liu

Advisor Role

Give supervision and feedbacks on thesis and schedule thesis defense.

Abstract

In recent years, the community of natural language processing (NLP) has seen amazing progress in the development of pre-trained language models (PLMs). The novel paradigm of PLMs does not require labeled data, allowing us to experiment with increased training scale through employing freely available colossal online self-training corpus to push the limits. Language models (LMs), such as GPT, BERT and T5, have achieved high performance on a wide range of NLP tasks. Meanwhile, research on zero-shot and few-shot text classification has received increasing attention. As labelling can be costly and time-consuming, how to perform data augmentation (DA) and enhance the current framework in a more effective and automatic way can be challenging. Existing performance of zero-shot and few-shot text classifications is far from satisfaction and human intervention is always required. Recently, introducing PLMs to solve these issues has become a new trend.

In this thesis we investigate modern techniques in zero-shot and few-shot text classifications and propose a series of novel methods to enhance the classification performance. For zero-shot text classifications, we propose a framework aiming at enriching domain specific information required by PLMs. To unleash the power of PLMs pre-trained on massive cross-section corpus, the framework unifies two LMs for different purposes: 1) expanding categorical labels required by PLMs by creating coherent representative samples with GPT2, which is a language model acclaimed for generating sensical text outputs, and 2) augmenting documents with T5, which has the virtue of synthesizing high quality new samples similar to the original text. The proposed framework can be easily integrated into different general testbeds. For the few-shot text classifications, we focus on designing data augmentation methods to enlarge training samples. We introduce text-to-text (seq2seq) language models into the DA framework that consists of two phases: a fine-tuning phase where PLMs including T5 and BART, are fine-tuned on unsupervised corpus under two novel schemes; a generation phase where we employ the fine-tuned text-to-text models to synthesize samples in DA to achieve performance lift of the classifier. We proposed two new fine-tuning schemes tailored for DAs systematically. Following our idea, other downstream desired NLP tasks can also benefit from this framework.

The experimental results demonstrate the effectiveness of the proposals. In zero-shot learning, we thoroughly compare the unified framework with three benchmarks and carefully examine each individual module by replacing it with an alternative. A detailed analysis of multiple factors that could affect the performance is also investigated. In DA on few-shot learning, we prove that our approach consistently outperforms state-of-the-art DA baselines on both topic and sentiment text classifications.

Convocation Year

2022

Convocation Season

Fall

Share

COinS