Combinatorial methods for analysis and compression of biological sequences (CMACBioSeq)

La ricercatrice studierà nuovi algoritmi e strutture, dati che permetteranno di gestire le grandi quantità di dati provenienti dalle moderne tecniche di sequenziamento del DNA, facendo uso della combinatoria delle parole e della teoria dell'informazione.
Il principale obiettivo sarà quello di confrontare collezioni di dati genetici e l'individuazione di anomalie genetiche nelle sequenze di DNA, al fine di determinare, per esempio, la presenza di eventuali malattie.

PI: Giovanna Rosone

Dipartimento: Dipartimento di Informatica
Data inizio: 23/09/2015
Data fine: 23/09/2018 (prorogato di 26 mesi: 23/11/2020)
Durata: 36 mesi
Costo del progetto: 440.310 €
Finanziamento ministeriale: UNIPA 48.763,68 € - UNIPI 371.546,32 €

 

Abstract

Storing, processing, querying and analyzing a large amount of data is a significant task that holds many obstacles and challenges. Programs for compressing, indexing and analyzing of texts are available. However, the problem of compression and computational analysis for large datasets, such as those generated by current and future sequencing technologies where data in the terabyte range is conceivable, is still a difficult issue: the analysis of the large amount of generated data is a bottleneck for many investigators.

The aim of my proposed project is to develop research in the area of Data Compression and Sequence Analysis (both for the theoretical models and applications in specific fields) by using methods from Combinatorics on Words and by developing algorithms and data structures for the treatment of strings, mainly datasets from Next-generation Sequencing (NGS) technologies or third generation sequencing technologies.
The close relationship between these two areas will add and will provide a fundamental contribution in the advancement of knowledge in both research areas.
We plan to introduce new approaches for the compression of short nucleotide sequences from the NGS technologies, together with additional metadata, such as the quality scores required to assess the reliability of the sequencing process, that guarantee the preservation of the salient dataset characteristics when a lossy approach is used.
We also plan to introduce new strategies for the analysis of biological sequences, also in succinct form by allowing the random access over compressed data, whose applications can be, but not limited to, de novo sequence assembly, comparison between genome sequences and reference genomic sequences, determining the presence or absence of genomic variant mutations, genome phylogeny problems, etc.

We are going to apply our methodologies to the sequences in Genome Trio Project. In this project, the idea is to compare the genome of a child to the genome its parents to search for the relatively small number of “de novo mutations” that have newly arisen in the child’s DNA and are not present in either parents. This is starting to be used to diagnose genetic disorders in newborn infants.

Moreover, we are going to apply them for comparing a set of DNA sequences derived from healthy cells with that from DNA sequences derived from no-healthy cells from the same individual in order to find genetic disorders. In particular, we are interested to cancer data from the Cancer Genome Atlas dataset.

Our goal will be achieved by using tools from combinatorics on words and from information theory. Particular attention will be devoted to an extension of the Burrows-Wheeler Transform to a multiset of texts. Our purposes will take further advantage of deep understanding of the combinatorial properties underlying of these transformations, but also of the development of succinct data structures and space-time-efficient implementations, such as implementations that use the external memory and the parallelization. This goal will be obtained by directing our research towards compression techniques and compressed indexes.

This research project takes also advantage of our collaboration with a team of researchers from the Computational Biology Group at Illumina Cambridge Ltd (UK), led by A. J. Cox. Illumina is the market leaders in NGS technology.

We believe that the methods here introduced here in the biological field can be of considerable interest especially for its possible applications in several fields such as linguistics, network security, etc. For instance, the intrusion detection problem in network security or the text categorization problems in computational linguistics.

Ultima modifica: Mar 23 Mar 2021 - 18:11

Questo sito utilizza solo cookie tecnici, propri e di terze parti, per il corretto funzionamento delle pagine web e per il miglioramento dei servizi. Se vuoi saperne di più, consulta l'informativa