Document Type


Degree Name

Master of Science (MSc)



Program Name/Specialization

Integrative Biology


Faculty of Science

First Advisor

Gabriel Moreno-Hagelsieb

Advisor Role

Thesis Supervisor


The vast increase in the number of sequenced genomes has irreversibly changed the landscape of the biological sciences and has spawned the current post-genomic era of research. Genomic data have illuminated many adaptation and survival strategies between species and their habitats. Moreover, the analysis of prokaryotic genomic sequences is indispensible for understanding the mechanisms of bacterial pathogens and for subsequently developing effective diagnostics, drugs, and vaccines. Computational strategies for the annotation of genomic sequences are driven by the inference of function from reference genomes. However, the effectiveness of such methods is bounded by the fractional diversity of known genomes. Although metagenomes can reconcile this limitation by offering access to previously intangible organisms, harnessing metagenomic data comes with its own collection of challenges. Since the sequenced environmental fragments of metagenomes do not equate to discrete and fully intact genomes, this prevents the conventional establishment of orthologous relationships that are required for functional inference. Furthermore, the current surge in metagenomic data sets requires the development of compression strategies that can effectively accommodate large data sets that are comprised of multiple sequences and a greater proportion of auxiliary data, such as sequence headers. While modern hardware can provide vast amounts of inexpensive storage for biological databases, the compression of nucleotide sequence data is still of paramount importance in order to facilitate fast search and retrieval operations through a reduction in disk traffic. To address the issues of inference and orthology a novel protocol was developed for the prediction of functional interactions that supports data sources that lack information about orthologous relationships. To address the issue of database inundation, a compression protocol was designed that can differentiate between sequence data and auxiliary data, thereby offering reconciliation between sequence specific and general-purpose compression strategies. By resolving these and other challenges, it becomes possible to extend the potential utility of the emerging field of metagenomics.

Convocation Year


Included in

Genomics Commons