PhD in Information Technology
Research Area: Cross-Sectoral Course

GENOMIC COMPUTING

SALA SEMINARI, DEIB, March 26 - April 10, 2019

Marco Masseroli
Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano



INTRODUCTION

This course provides an introduction to genomic computing, i.e. the application of computer engineering and mathematics to genomics, with an emphasis on concrete technologies and hands-on practice. It is intended for first-year PhD students who will focus their research on bio-informatics, but it can be followed by PhD students and researchers of DEIB, MAT and other departments who wish to broaden their culture.

The course will start by familiarizing students on the foundational molecular biology concepts (encoding of information in the DNA sequences and its functional use at the genomic, epigenomic, transcriptomic and proteomic levels), as well as on the DNA/RNA sequencing techniques (including the novel Next Generation Sequencing - NGS). Then, current sequencing data standards and tools for their (pre)processing and visualization will be presented, with emphasis on info-math issues related to their management and analysis. The last part of the course is devoted to the (epi)genomic analysis of multiple heterogeneous NGS data using different instruments, including the Genometric Query Language (GMQL) recently developed in our group for the analysis of NGS big data.

The course evaluation will be of practical nature: students will be asked to replicate offline some of the work performed in class upon a small sample of genomic data and to present the result of their work to the course instructors.


Polimi course website

How to reach DEIB Seminar room



Schedule


Lecturers

(in order of appearance)
  • Marco Masseroli, Ph.D.

    Genomic Computing Group - Dipartimento di Elettronica, Informazione e Bioingegneria - Politecnico di Milano

  • Mattia Pelizzola, Ph.D.

    Center for Genomic Science - Istituto Italiano di Tecnologia (IIT), Milano



LECTURES AND PRACTICES

  • Monday March 26, morning (4 h and 15 m) - Seminar Room


    9:00-9:15
    Marco Masseroli: Course introduction, motivations, outline.


    9:15-11:15
    Marco Masseroli: Genomic research concepts and application perspective (2 h lecture)
    Overview of biomolecular genomic concepts and their bio-technological applications with emphasis on aspects relevant for genomic data management and analysis.


    11:30-13:30
    Mattia Pelizzola: NGS technologies (2 h lecture)
    Introduction to the principles behind Next Generation Sequencing (NGS) technologies, description of the different types of data produced, and illustration of some NGS applications. An overview of several international projects that use NGS technologies is also provided.



  • Wednesday March 27, morning (2 h) - PT2 Room


    10:00-12:00
    Marco Masseroli: Genomic Data Model (GDM) and GenoMetric Query Language (GMQL) as research enabler to discover genome properties (2 h lecture)
    Next Generation Sequencing (NGS) technologies are producing data at increasing speed and reducing costs, therefore managing NGS data is quickly becoming the biggest "big data" problem of mankind. In this context, GMQL provides a next-generation query language for querying NGS big data. It operates upon aligned genomic data in a variety of data formats; it provides parallel computation in the cloud, thereby supporting queries over thousands of samples, such as the ones provided by ENCODE and TCGA consortia. The language's name indicates its ability to compute massive operations on genomic regions, which take into account region relative positions and distances. In this lecture, principles and concepts of GMQL will be illustrated, together with some examples of its use.



  • Wednesday March 27, afternoon (1 h and 30 m) - PT2 Room


    15:00-16:30
    Marco Masseroli: GDM and GMQL examples (1 h and 30 m practice)
    This practice will be devoted to learning GMQL by appling it to find answer to some practical use cases.



  • Thursday March 28, morning (3 h) - Seminar Room


    9:30-10:30
    Mattia Pelizzola: NGS raw data processing and formats (1 h lecture)
    The data processing workflow typically followed after the production of FASTQ files by Next-Generation sequencing machines will be described. The FASTQ file format and a collection of tools to perform quality controls and to preprocess FASTQ files of short-reads will be illustrated. The most common alignment programs used to align the short-reads to the reference genomes will be introduced and the resulting BAM/SAM file format will be described. Finally, other data formats used in NGS data analysis such as BED files and tools to manipulate them (SAMtools and BEDtools) will be discussed.


    10:30-12:30
    Mattia Pelizzola: ChIP-seq data analysis pipeline (2 h lecture)
    ChIP-seq is a methodology commonly used to identify binding locations (peaks) for DNA-binding proteins, typically transcription factors. This half day section will cover the most common steps in the analysis of these data, including reads pre-processing and alignment, peaks identification and possibly analysis of the DNA motifs in the peak regions. Finally, resulting alignments and peaks data will be visualized and inspected using the UCSC genome-browser.


  • Monday April 1, morning (3 h) - PT2 Room


    9:30-12:30
    Marco Masseroli: DNA viewing and qualitative analysis in a genomic context using the UCSC Genome Browser and the Integrated Genome Browser (1 h lecture + 2 h practice)
    The UCSC Genome Browser and the Integrated Genome Browser (IGB) provide a system to navigate along the genomic DNA of many organisms, with extensive annotation ‘tracks’ for various data types, including known genes, predicted genes, SNPs, comparative multi-species analysis data and deep-sequencing experimental results. This lecture will be an introductory session focuses on the foundation for the organization and display of the genomic data. It will comprise both a theoretical lesson and practical tutorial and it will offer an overview of specific tasks.



  • Tuesday April 2, morning (3 h) - Seminar Room


    9:30-12:30
    Mattia Pelizzola: R/Bioconductor framework for data analysis (1 h lecture + 2 h practice)
    The R/Bioconductor framework and scripting language will be illustrated through practical examples, showing their potential for genomic data processing and analysis.


  • Wednesday April 3, morning (2 h) - Seminar Room


    10:00-12:00
    Marco Masseroli: Tools for the integrative analysis of heterogeneous (epi)genomic data features: GMQL I (1 h lecture + 1 h practice)
    In this practical section, the use of some main tools for the analysis of region-based heterogeneous (epi)genomic features will be illustrated. In particular, it will be shown how leveraging the recently developed GMQL to extract information from big data repository of experimental genomic result. Hands-on practice will allow direct experience in the use of GMQL on ENCODE and TCGA data. Specifically, the students will be challenged on the definition of genomic regulatory regions and implementing rules to define their interactions in different biological conditions.



  • Monday April 8, morning (2 h and 30 m) - PT2 Room


    9:30-12:00
    Marco Masseroli: Tools for the integrative analysis of heterogeneous (epi)genomic data features: GMQL II (2 h and 30 m practice)
    Hands-on practice will allow direct experience in the use of GMQL on ENCODE and TCGA data. Specifically, the students will be challenged on the definition of genomic regulatory regions and implementing rules to define their interactions in different biological conditions.



  • Thursday April 9, morning (3 h) - Seminar Room


    9:30-12:30
    Mattia Pelizzola: Integrative analysis of heterogeneous epigenomics data with R/Bioconductor (1 h lecture + 2 h practice)
    The methylPipe and compEpiTools Bioconductor packages will be used for the integrative analysis of epigenomics data. The students will be guided on (i) how to identify a set of differentially methylated regions, and (ii) how to explore patterns of additional data types within those regions, through the generation of integrative heatmaps combining annotation and epigenomics data tracks. The analyses will also briefly touch upon enhancer identification and the annotation of genomic regions. NGS data will be derived from the Roadmap Epigenomics and ENCODE projects, two major American projects focused on profiling multiple omics in few widely used cell types.


  • Wednesday April 10, morning (3 h and 15 m) - PT2 Room


    9:30-10:30
    Marco Masseroli: Searching similar (epi)genomics feature patterns in multiple genome browser tracks: SimSearch (30 m lecture + 30 m practice)
    Genome browsers (e.g., IGB) are tools to visually compare and browse through multiple genomic feature samples aligned to the same genome reference and laid out on different genome browser tracks. They allow the visual inspection and identification of interesting “patterns” on multiple tracks, i.e. sets of genomic regions/peaks at given distances from each other in different genome browser tracks. Nevertheless, once such patterns are visually identified in a genome section, the search of their occurrences along the whole genome is a complex computational task currently only supported by SimSearch. It is an IGB plugin implementing an optimized “similarity”-based pattern-search algorithm able to efficiently find, within a large set of genomic data, genomic region sets which are similar to a given pattern. The IGB plugin allows intuitive user interaction in both the visual selection of an interesting pattern on the loaded IGB tracks and the visualization within the IGB of the occurrences along the entire genome of the region sets similar to a selected pattern found by the SimSearch algorithm. The concepts on which id is based the defined algorithm are presented, together with some obtained results that demonstrate the efficiency and accuracy of the method, as well as the utility of the tool.


    10:30-12:30
    Marco Masseroli: Tools for the integrative analysis of heterogeneous (epi)genomic data features: GMQL III (2 h practice)
    Hands-on practice will allow direct experience in the use of GMQL on ENCODE and TCGA data. Specifically, the students will be challenged on the definition of genomic regulatory regions and implementing rules to define their interactions in different biological conditions.


    12:30-12:45
    Marco Masseroli: Course conclusion and evaluation test description




© Marco Masseroli, Ph.D. E-mail marco DOT masseroli AT polimi DOT it