Vai al contenuto principale

Irene Azzali

Phd thesis

Enhancing Machine Learning Approaches for BioData Mining in Veterinary Sciences

Vector-borne diseases (VBDs) are illnesses caused by parasites, bacteria or viruses that are transmitted by vectors, which include fleas, ticks and mosquitoes. These diseases considerably impact the economic and public health, thus the development of effective prevention measures is essential. Besides the treatment and control of the disease itself, an approach to monitor VBDs is by means of vector control. Developing effective vector control measures allows for an immediate interruption of disease transmission and helps in disease eradication. Vector abundance modelling is one of the approaches to address vector control. The prediction of vector occurrence and the understanding of vector-habitat link are crucial to the early warning of pathogen circulation and to guide vector control strategies. 

Modelling techniques share the principle to learn from data, however, we recognize two main approaches differing for their main purpose. The main objective of statistical modelling is the discovery and the interpretation of the relationship among the included variables. Machine learning modelling is, instead, a set of techniques that learn the best model underlying data in order to make predictions on unseen data. The focus this time is in the accuracy of the forecast, penalizing the readability of variables interactions. The research is moving towards machine learning (ML) models that can better catch the complex interplay between environment, climate and vectors. Compared with statistical methods, ML does not reside on assumptions.

In an application perspective, models have their greatest utility when they can be used predictively and not simply as a means of exploring putative relationships among variables in the data. This consideration still leads towards further exploration of ML methods in VBDs abundance prediction. However, the main drawback of ML methods is the lack of readability and interpretability. This drawback is overcame by a ML techniques called Genetic Programming (GP). GP automatically discovers solutions to problems by applying search principles analogous to those of natural evolution. Respect to classical ML methods, GP has the advantage to obtain readable models for the user, which allows for interpretation when the model formula is not too complex.

In this work we focus on the most studied disease vectors: mosquitoes. The overall goal of this work is to investigate the innovative use of GP in the field of vector abundance prediction. In particular we start from the available data of Culex pipens counts collected in the context of the surveillance programme promoted by IPLA [3] in Piedmont region. We expect GP to improve at least the predictive accuracy of statistical modelling which limits the interaction between predictors. GP structure has even the advantage of being easy to modify in order to satisfy problem requirements. This feature is of great importance since we can enhance classical GP approach to properly deal with the different data format that we may face in vector abundance dataset. Time series are, in fact, frequently present in these dataset. The seasonal dynamics of vector population is likely to be associated with the fluctuation of climatic and weather variables over time, thus time series data. At the moment, classical methods in statistical and ML modelling are mostly unable to handle time series as ordered sequence of values. The development of an innovative approach of GP in the context of epidemiological modelling will even be the springboard of a new ML technique that deals with vectors. By means of this work we want to make the veterinary community aware of GP technique as a modelling approach that can sum up three important goals of ecological modelling: readability of the model, ability to catch any functional forms describing the data and ability to adapt to data without distorting their nature.

Research activities


The aim of this project is to investigate Machine Learning (ML) algorithms on different veterinary science problems, due to their ability in treating huge amount of complex and heterogeneous data. In particular we focus on Genetic Programming (GP) that is able to work on raw data and to provide readable models. We are going to develop a new GP approach able to properly treat time series data and thus a powerful tool for monitoring vector dynamics.

Last update: 16/02/2022 13:30

Non cliccare qui!