Georgios Galanos, "Hardware acceleration of genome assembly algorithms", Diploma Work, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2021
https://doi.org/10.26233/heallink.tuc.90191
Genome assembly is a field of bioinformatics that refers to the process of takingsmall fragments of genetic material and putting them back together bydifferent methods in order to reconstruct the original sequence from whichthe DNA originated. As the DNA input datasets has numerous data size andin most cases has a very large amount of data, it is important to implementfunctions and algorithms in order to speedup these processes and gain significant time and space reductions in complexity. The Reads Matching Filter(RMF), which i implemented and present in this diploma thesis, is a kind ofthese processes and it has a preprocessing role in the whole genome assemblyprocess.The RMF takes the input dataset which contains the genetic material separatedin reads, one per line and implement a matching process between eachother in order to find unused redundancy. As the matching process executedsuccessfully, the unused redundancy thrown out of the dataset and remainthe output reads from the algorithm which they called intermediate contigs.The final output file that contains these intermediate contigs has less readsin number and bigger or equal than the input dataset’s reads in length butwithout the unused redundancy and in this way the overall dataset size getssmaller. Exploited this result, the genome assembly process take a smallerdataset as input and as a result gain a time benefit in execution procedure.The above algorithm implemented both in a software only and in a softwarehardware design in Field Programmable Gate Array (FPGA) in order to gain an acceleration in execution time. The outputs of my design and the original input dataset are given as input in Velvet genome assembler which based on the manipulation of de Bruijn graphs, via the removal of errors and the simplication of repeated regions, in order to process the assembly and givethe output sequences. The overall design included the genome assemblyprocessing gained a speedup of the order of 2x-6x ratio, with good quality inthe results between the two methods.