Bachelor and master thesis projects

Bachelor theses

The practical course "300224 PP Genome analysis of prokaryots - applied bioinformatics for the analysis of a genome sequence" is well suited for a bachelor project in the Biology curriculum.
For bachelor theses in other curricula please contact Harald Marx or Thomas Rattei.

Master theses

Our research projects provide permanently new topics for master theses. We are happy to adapt the topic of a thesis work to your experience and interests. Please contact Harald Marx or Thomas Rattei for more information.

Thesis project example 1: Extraction of prokaryotic phenotypes from literature

The accessibility of almost complete genome sequences of uncultivable microbial species from metagenomes and from direct sequencing of clinical isolates necessitates computational methods predicting microbial phenotypes solely based on genomic data. Phenotypic traits of microbes can be very diverse. They range from morphologic and physiological traits to specific molecular or metabolic capabilities. The broad evolutionary diversity of microbial traits will be a substantial challenge for computational methods. Our group has developed a machine learning approach (PICA; Feldbauer et al., 2015) based on comparative genomics, which could so far be successfully trained for the prediction of about 20 traits.

The task of this thesis is to provide more and better training data for machine learning of microbial phenotypes. We need to automatically extract links between microbial species names and phenotype descriptions from open-access scientific literature. This problem is challenging due to the ambiguity of scientific terms and species names in free text. We will therefore utilize and adapt established approaches for term disambiguation and entity recognition in text.

Solving this problem in a master thesis will allow you to make a major contribution of one of the hottest research areas in microbial computational genomics. You should have a good background in computational science, bioinformatics and life science. You should be interested in programming, text mining and machine learning, as well as in microbiology and microbial ecology. The thesis project will provide you substantial training in these fields, and allows you to develop your own ideas and concepts within the frame of the project.

Contact: Thomas Rattei

THESIS PROJECT EXAMPLE 2: SUPERVISED machine Learning for PHENOTYPE prediction

About 100 years after the first use of antibiotic drugs, infectious disease is not yet eradicated. Bacterial infections have kept high ranks even in the recent statistics of the World Health Organization on the global burden of disease and on mortality. Although most bacterial infections are easily cured in modern healthcare, epidemic outbreaks are still possible and multi-resistant pathogens are an increasing problem especially in hospitals. Besides developing novel options in treatment and prevention, we also need to improve monitoring and diagnostics of human pathogens. Due to the rapidly improving techniques and decreasing costs for sequencing DNA, genome-based bacterial diagnostics is currently changing from science-fiction to a real option. Rapid and precise sequencing of bacterial genomes would not only allow for better diagnostics and risk assessment, but would also allow to develop personalized treatment.

Genome-based bacterial diagnostics fundamentally challenges the current generation of bioinformatic methods for genome analysis. New concepts need to be established for the prediction of complex phenotypic traits, such as virulence. For our web based tool EffectiveDB we have developed genome-based models for the prediction of intact and functional virulence factors: the Type III, Type IV and Type VI protein secretion systems. These models work well for most bacteria, but make unreasonable predictions for several species. In this project we will extend genome-based models for the prediction of virulence factors. We will incorporate expert knowledge into the models, to improve their predictive performance.

Approaching this problem in a master thesis will give you practical insight and experience in comparative genomics of important human and plant pathogens. You should have a good background in computational science, bioinformatics and life science. You should be interested in programming and machine learning, as well as in microbiology, molecular biology and microbial ecology. The thesis project will provide you substantial training in these fields, and allows you to develop your own ideas and concepts within the frame of the project.

Contact: Thomas Rattei

THESIS PROJECT EXAMPLE 3: rapid phenotype prediction for metagenomes

The investigation of microbial communities, organismal communities inhabiting all ecological niches on earth, has in recent years been strongly facilitated by the rapid development of experimental, sequencing and data analysis methods. Novel experimental approaches and binning methods in metagenomics render the semi-automatic reconstructions of near-complete genomes of uncultivable bacteria possible. Such genome-centric metagenomics approaches are now used in different areas of life science, e.g. in medicine, microbiology and microbial ecology. User-friendly, efficient and powerful computational tools are needed for the analysis of metagenomic data.

In this project we will implement a novel, web-based platform for the automatic analysis of draft genomes from metagenomes. It should allow users to quickly analyze thousands of genomes, including the prediction of phenotypic traits. Besides the analysis of user-defined data we will also provide pre-calculated predictions for all publicly available microbial genomes. Components of the web platform already exist in our group, such as phenotype models (PICA) and an internal database of publicly available complete genome sequences. The project will therefore focus on the conceptual design of the web platform, a prototype implementation and performance testing.

Implementing theweb platform in this master thesis will allow you to create a highly important and so far missing tool for microbial computational genomics. You should have a good background in computational science, bioinformatics and life science. You should be interested in programming, web frameworks and databases, as well as in microbiology and microbial ecology. The thesis project will provide you substantial training in these fields, and allows you to develop your own ideas and concepts within the frame of the project.

Contact: Thomas Rattei

THESIS PROJECT EXAMPLE 4: Prediction of peptides in viral polyproteins

Different lineages of viruses encode polyproteins in their genomes. These are long proteins, which consist of different functional units. To become active, the polyprotein is cleaved by host or viral proteases into segments of biochemically active peptides. The computational prediction of peptides in polyproteins is so far very limited. Only one prediction tool for few lineages of human viruses has been developed (VIPR; unpublished). However, such a prediction tool would be extremely valuable for comparative genomics of viruses, such as in our Virus Orthologous Groups (VOGDB).

The aim of this thesis project is to utilize the rapidly growing number of completely sequenced virus genome. We want to analyze large datasets of polyproteins, to identify cleavage pattern and other characteristics of polyproteins. This information should then be used in a machine learning approach, which predicts cleavage sites in viral polyproteins. Mass spectrometry datasets will be used as independent test data for evaluation and validation of the new approach.

Approaching this problem will give you practical insight and experience in genomics of viruses. You should have a good background in computational science, bioinformatics and life science. You should be interested in programming and machine learning, as well as in microbiology, molecular biology and microbial ecology. The thesis project will provide you substantial training in these fields, and allows you to develop your own ideas and concepts within the frame of the project.

Contact: Thomas Rattei

THESIS PROJECT EXAMPLE 5: Proteogenomic search space construction to explore microbial peptidomes

Antibiotic-resistant bacteria, so called superbugs, are threatening to kill almost 10,000,000 people in 2050 worldwide. In infectious disease, transmissible superbugs make use of various strategies to invade and colonize a niche in one of the host’s inherent microbiomes. To discover novel drugs, it is paramount to fully understand those invasive mechanisms, but also putative microbiome defenses to ward off pathogens. A key, but not well-understood player in these processes are bioactive peptides, which display antimicrobial, signal, and regulatory properties.
Recent efforts in metagenomics provide a first glimpse into the genetic composition and complexity of the microbial peptidome. To advance in depth characterization, mass-spectrometry (MS) based proteomics offers orthogonal evidence in the hunt for elusive bioactive peptide encoding genes. Thus, proteogenomic approaches leverage genomic and proteomic data to improve the ongoing structural genome annotation.

However, searching MS data against genome-scale databases poses a computational challenge to common search engines, greatly reducing identification specificity. To alleviate this issue, the project goal is to build a database from large-scale ‘ome sources to identify most entities in a biological sample, striking a balance between search space completeness and complexity. This entails: i) to implement efficient search data structures, ii) to infer a probabilistic model for peptide detectability, and iii) to develop a MS-centric clustering algorithm.

The ideal candidate is highly motivated and has a background in computational science or bioinformatics. Strong programming skills in Java are required to succeed in this project. We provide a dynamic work environment in an exciting and upcoming research area. Please send your curriculum vitae and a brief statement of future career goals to Harald Marx.

THESIS PROJECT EXAMPLE 6: A community driven database to annotate and curate the peptidome

The central dogma in molecular biology comprises the core ‘omes namely, genome, transcriptome and proteome. As of late, the new kid on the block is the peptidome, akin to the proteome, capturing peptide sequences shorter than 100 amino acids. In the peptidome, bioactive peptides play a critical role with varying functions across the tree of life, e.g. in the gastrointestinal, cardiovascular, endocrine, nervous, and immune system. However, bioactive peptide encoding genes do not exhibit common gene structure signals, that are crucial to gene predictors and consequently protein sequence database construction. Hence, most protein sequence databases contain anecdotal information of bioactive peptides, reflecting a minuscule part of the peptidome. Nascent efforts in proteogenomics help to improve gene predictions, but database consortia have not yet implemented appropriate pipelines to incorporate mass-spectrometry (MS) data. In addition, another limiting factor is the specialist-driven curation process to produce high quality databases, which takes extensive time to get research results into the public domain.

To address those issues, the project goal is to construct an easy to use community-driven peptidome database. The database backend will be a pipeline that manages data quality, storage, and retrieval. A stringent quality control system will ensure incorporation of bona fide proteogenomic search results. The web interface will allow simple user registration, data curation, search, and submission.

The ideal candidate is highly motivated, versatile and has a background in computational science or bioinformatics. The project requires Java programming skills, working knowledge of PHP, MySQL, and web development (bootstrap). We provide a dynamic work environment in an exciting and upcoming research area. Please send your curriculum vitae and a brief statement of future career goals to Harald Marx.

THESIS PROJECT EXAMPLE 7: A classifier to assess spectrum quality in mass spectrometry-based proteomics

Mass spectrometry(MS)-based proteomics is a powerful method to analyze the proteome and peptidome. A typical large-scale, high-throughput MS experiment results in millions of spectra that greatly vary in information content due to manual and automatic experimental parameters, such as ion fill time, fragmentation method, collision energy, and transient time, among others, making analyte identification a computational challenge. To ease spectrum interpretation, pre-processing steps like charge state deconvolution, deisotoping, signal-to-noise filtering, simplify the spectrum representation. In the following identification step, a search algorithm matches theoretical spectra from a protein sequence database to the experimental spectra. 

Even though common search engines implement approaches to control false positive identifications, most do not control spectrum quality in the above steps. In this project we will build a binary classifier to assess spectrum quality prior to sequence assignment ultimately improving search results. Easy access and availability to data sets of large synthetic peptide libraries from various mass spectrometry platforms allows us to explore ubiquitous spectrum features that correlate with quality. 

The ideal candidate is highly motivated and has a background in computational science or bioinformatics. The project requires Java programming skills and a basic understanding of mass spectrometry. This is a great opportunity to learn the intricate analytical details of MS-based proteomics. We provide a dynamic work environment in an exciting and upcoming research area. Please send your curriculum vitae and a brief statement of future career goals to Harald Marx.