Vai al contenuto principale

Dott. Eugenio Mazzone

Phd thesis

Development of bioinformatic tools for canine cancer biomarker discovery and validation

Scientific Background

Cancer is a genetic disease caused by multiple and sequential mutations in critical genes of neoplastic cells. These modifications alter pathways that regulate cell growth and communication with the surrounding tissue, leading to uncontrolled cellular proliferation. Proper interpretation of the vast DNA information is crucial for understanding and treating cancer comprehensively. Next-Generation Sequencing (NGS) has become the gold standard for oncology research, providing accurate and powerful tools for describing the genomic makeup of cancers by generating millions of bases in a short time and at a reasonable cost. Cancer is the leading cause of death in dogs aged over 10 years, affecting 50% of them, with one in four ultimately succumbing to the disease. The similarities in clinical, histological, and epidemiological characteristics, as well as molecular and genetic profiles, between canine and human tumors, make dogs an excellent model for studying human disease and vice versa. The spontaneous occurrence of tumors in dogs, driven by genomic changes affecting corresponding biological pathways, further solidifies this model. Hence, it is vital to identify new biomarkers and therapeutic targets that can improve the clinical management of the most frequent tumors. This research not only benefits dogs but also provides valuable insights into comparative oncology for human medicine.


Goals

The primary objective of this project is to construct a comprehensive genetic description of canine tumors by creating a canine atlas of somatic mutations. The results will be categorized into three main types of genetic aberrations: small mutations (including SNPs and Indels), copy number variations (encompassing large amplifications and deletions), and other structural variants (such as kataegis, translocations, etc.). This collection of genetic aberrations will be the basis for a variety of statistical analyses aimed at identifying mutation co-occurrence, cancer type-specific genetic markers, establishing a repository of wild-type references for further studies, and developing a canine cancer target gene panel for routine clinical use. Moreover, a new version of canine dbSNP will be developed, incorporating a compilation of germline mutations identified in the large cohort of normal canine tissues. Given the careful attention to standardization in the resulting datasets (the mutations atlas and germline repository), both will have the potential for ongoing development by future research efforts. Hopefully they will remain state-of-the-art data collections that will continue to grow over time, becoming enriched with additional histotypes and enhancing confidence in the identified mutations.


Methods

Data Preparation: The analysis starts by gathering raw sequence reads data from the Sequences Read Archive (SRA). Only tumors that underwent whole exome sequencing (WES) and have matched normal counterparts are included. Subsequently, the collected data will be pre-processed using GATK (Genome Analysis Toolkit). This step will serve to reduce the chance of identifying sequencing artifacts in the final data and recalibrate the base quality of the sequenced bases, primarily considering guanine-cytosine content. Following the collection and preprocessing of the data, both tumor and matched normal samples undergo a stringent quality check (QC) of coverage. 

SNVs and Indels: Single nucleotide variants and insertions and deletions will be identified using three variant callers (Mutect2, Strelka, and Varscan2). A majority voting procedure will then be applied to retain only mutations called by at least two out of three callers. This step aims to enhance the quality of the mutations retrieved and mitigate the effects of statistical models and respective programming choices. 

Copy Number Variations (CNV): To further emphasize the significance and utility of the atlas, we will also incorporate large-scale mutations. These aberrations involve amplifications, deletions, or translocations of large genomic regions (at least 1kb). These alterations are recognized to lead to gene overexpression, complete gene deletions, or loss of heterozygosity (e.g., loss of one copy of the chromosomes), resulting in heightened susceptibility to mutations in the remaining copy. 

Clinical Genomic Landscape: To extract clinically relevant insights from mutational data, statistical analyses will be conducted. These analyses will encompass Anova, Spearman correlation, as well as univariate and multivariate methods. Moreover, machine learning (ML) models will be developed, including decision trees (CART), shallow neural networks (NN), and unsupervised learning techniques like k-means clustering. These methodologies will facilitate a comprehensive analysis of the mutational data and offer valuable insights into potential biomarkers or therapeutic targets for the disease under investigation. 


References

Amin, S. B. et al. Comparative Molecular Life History of Spontaneous Canine and Human Gliomas. Cancer Cell 37, 243-257.e7 (2020).

Megquier, K. et al. Comparative Genomics Reveals Shared Mutational Landscape in Canine Hemangiosarcoma and Human Angiosarcoma. Molecular Cancer Research 17, 2410–2421 (2019).

Bailey, M. H. et al. Comprehensive Characterization of Cancer Driver Genes and Mutations. Cell 173, 371-385.e18 (2018).

Greener, J. G., Kandathil, S. M., Moffat, L. & Jones, D. T. A guide to machine learning for biologists. Nat Rev Mol Cell Biol 23, 40–55 (2022).

McKenna, A. et al. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://doi.org/10.48550/arXiv.1303.3997 (2013).

Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience 10, giab008 (2021).

Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol 31, 213–219 (2013).

Kim, S. et al. Strelka2: fast and accurate calling of germline and somatic variants. Nat Methods 15, 591–594 (2018).

Koboldt, D. C. et al. VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 22, 568–576 (2012).

Bai, B. et al. DoGSD: the dog and wolf genome SNP database. Nucleic Acids Res 43, D777–D783 (2015).

Research activities

Co supervisor

Piero Fariselli

Publications:

  • Mazzone E, Moreau Y, Fariselli P, Raimondi D. Nonlinear data fusion over Entity-Relation graphs for Drug-Target Interaction prediction. Bioinformatics. 2023 Jun 1;39(6):btad348. doi: 10.1093/bioinformatics/btad348. PMID: 37255310; PMCID: PMC10265447.

Professional Training:

  • Occam Workshop NG/2.
  • Next-Generation Sequencing Data Analysis: A Practical Introduction , ECSEQ, Munich (DE).
  • Festival of Genomics & Biodata, London (GB), 2024.
  • Mandatory Courses.

Oral Presentation:

  • Another Brick in the Genomic Wall of Canine Lymphoma, Game of Research, Grugliasco, 2023.
Last update: 13/02/2024 16:58

Location: https://dott-scivet.campusnet.unito.it/robots.html
Non cliccare qui!