Date of Award


Document Type

Master Thesis

Degree Name

Master of Science


Department of Biological Sciences

First Advisor

Aisling O'Driscoll


Multiple Sequence Alignment (MSA) of DNA and protein sequences is one of the most essential techniques in the fields of molecular biology, computational biology and bioinformatics. It aids the identification and prediction of three dimensional structures, primary functions and evolutionary relatedness amongst groups of species, organisms, and genes. Since as the completion of the Human Genome Project and with the advent of sequencing initiatives such as the Genome 10K project, the rate of genome sequencing has increased exponentially, producing vast amounts of DNA and protein sequences. MSA algorithms, when applied to such sequence data, can identify common homology, structure and function to aid disease recognition, medicine discovery and gain better overall knowledge of genomes and proteins. Therefore MSA analysis of such large sequence data sets is fundamental in order to facilitate future medical discoveries.

Clustal Omega, is viewed by many as a leading multiple sequence alignment algorithm, aligning more than 190,000 sequences on a single processor in a few hours with high quality and faster processing times than its predecessors. However, despite these capabilities, due to scale of sequence data being produced, there is a requirement for an algorithm that can align vast quantities of sequences, spanning multiple disks, in a cost effective and timely manner. Such “big data” techniques, while very much in their infancy, are currently employed in the technology sector to process and analyse large data sets generated from social media software.

This thesis proposes a distributed and parallelised solution for Clustal Omega using the Hadoop/MapReduce ’big data’ paradigm. A detailed design for the pairwise alignment component is provided. Given that Hadoop is currently able to process a PB of data in 16.25 hours and TB in 62 seconds, the proposed system will be able to realise cost effective and timely large scale alignment of PB sized sequence data sets.


Project Thesis in partial fulfilment for the degree of Masters in Computational Biology 2012

Access Level