Statistics thesis topics

We provide a dynamic and encouraging environment for postgraduate research in Statistics. Our focus on innovative and impactful research promotes the exploration of ideas that significantly enhance understanding across various disciplines, from theoretical foundations to real-world applications. Students are encouraged to engage with complex statistical challenges, fostering skills that are essential for addressing contemporary issues in science, industry, and beyond.

Research Groups and Collaborations

Our department hosts various research groups that foster collaboration and interdisciplinary work. Collaborations with industry partners and other academic institutions also enhance the relevance and application of our research.

Sample thesis topics

The following are sample topics available for prospective postgraduate research students in our department. These examples showcase the diversity of research areas you can explore, but they are not an exhaustive list. Most supervisors are eager to discuss potential projects that may not be featured here. Funded projects come with specific project funding, while financial support for other research initiatives is typically awarded on a competitive basis.

Centres for Doctoral Training (CDTs)

ExaGEO

ExaGEO (Exascale computing for Earth, Environmental, and Sustainability Solutions) train the next generation of Earth and environmental scientists to harness the power of exascale computing. The application process and potential project will be linked from here soon.

The Leverhulme Programme for Doctoral Training in Ecological Data Science

Our Leverhulme Programme for Doctoral Training in Ecological Data Science will train a new generation of data scientists. Students will be equipped with the skills to tackle the most pressing environmental challenges of our time and be trained in the latest data science techniques. Application information can be found on their Apply page.

IAPETUS2

Named after the ancient ocean that closed to bring together England and Scotland. Iapetus2 is a partnership that joins the leading research universities of Durham, Heriot Watt, Glasgow, Newcastle, St Andrews and Stirling, together with the British Antarctic Survey, British Geological Survey and the Centre for Ecology & Hydrology, in a united approach to doctoral research and training the next generation of leaders in the science of the natural environment. Application information can be found on their Apply page.

DiveIn (EPSRC CDT in Diversity-Led, Mission-Driven Research)

The DiveIn CDT prioritises diversity, creating an inclusive space for varied talents to produce transformative interdisciplinary research in Net Zero, AI and Big Data, Technology Touching Life, Future Telecoms, Quantum Technologies and more. Application information can be found on their Apply page.

NETGAIN

NETGAIN (developing the science and practice of nature markets for a net positive future) is a collaborative CDT held between the Universities of St Andrews, Aberdeen, Durham and Glasgow. NETGAIN will train a new generation of multidisciplinary scientist-practitioners to transform the landscape of nature markets, ensuring effective, evidence-based solutions to the world’s most urgent environmental challenges. The application process and potential project will be linked from here soon.

Innovation in analysis and Inference

Applied Probability and Stochastic Processes - Example Research Projects

Information about postgraduate research opportunities and how to apply can be found on the Postgraduate Research Study page. Below is a selection of projects that could be undertaken with our group.

Evaluating probabilistic forecasts in high-dimensional settings (PhD)

Supervisors: Jethro Browell
Relevant research groups: Modelling in Space and TimeComputational StatisticsApplied Probability and Stochastic Processes

Many decisions are informed by forecasts, and almost all forecasts are uncertain to some degree. Probabilistic forecasts quantify uncertainty to help improve decision-making and are playing an important role in fields including weather forecasting, economics, energy, and public policy. Evaluating the quality of past forecasts is essential to give forecasters and forecast users confidence in their current predictions, and to compare the performance of forecasting systems.

While the principles of probabilistic forecast evaluation have been established over the past 15 years, most notably that of “sharpness subject to calibration/reliability”, we lack a complete toolkit for applying these principles in many situations, especially those that arise in high-dimensional settings. Furthermore, forecast evaluation must be interpretable by forecast users as well as expert forecasts, and assigning value to marginal improvements in forecast quality remains a challenge in many sectors.

This PhD will develop new statistical methods for probabilistic forecast evaluation considering some of the following issues:

  • Verifying probabilistic calibration conditional on relevant covariates
  • Skill scores for multivariate probabilistic forecasts where “ideal” performance is unknowable
  • Assigning value to marginal forecast improvement though the convolution of utility functions and Murphey Diagrams
  • Development of the concept of “anticipated verification” and “predicting the of uncertainty of future forecasts”
  • Decomposing forecast misspecification (e.g. into spatial and temporal components)
  • Evaluation of Conformal Predictions

Good knowledge of multivariate statistics is essential, prior knowledge of probabilistic forecasting and forecast evaluation would be an advantage.

Adaptive probabilistic forecasting (PhD)

Supervisors: Jethro Browell
Relevant research groups: Modelling in Space and TimeComputational StatisticsApplied Probability and Stochastic Processes

Data-driven predictive models depend on the representativeness of data used in model selection and estimation. However, many processes change over time meaning that recent data is more representative than old data. In this situation, predictive models should track these changes, which is the aim of “online” or “adaptive” algorithms. Furthermore, many users of forecasts require probabilistic forecasts, which quantify uncertainty, to inform their decision-making. Existing adaptive methods such as Recursive Least Squares, the Kalman Filter have been very successful for adaptive point forecasting, but adaptive probabilistic forecasting has received little attention. This PhD will develop methods for adaptive probabilistic forecasting from a theoretical perspective and with a view to apply these methods to problems in at least one application area to be determined.

In the context of adaptive probabilistic forecasting, this PhD may consider:

  • Online estimation of Generalised Additive Models for Location Scale and Shape
  • Online/adaptive (multivariate) time series prediction
  • Online aggregation (of experts, or hierarchies)

A good knowledge of methods for time series analysis and regression is essential, familiarity with flexible regression (GAMs) and distributional regression (GAMLSS/quantile regression) would be an advantage.

Modality of mixtures of distributions (PhD)

Supervisors: Surajit Ray
Relevant research groups: Nonparametric and Semi-parametric StatisticsApplied Probability and Stochastic ProcessesStatistical Modelling for Biology, Genetics and *omicsBiostatistics, Epidemiology and Health Applications

Finite mixtures provide a flexible and powerful tool for fitting univariate and multivariate distributions that cannot be captured by standard statistical distributions. In particular, multivariate mixtures have been widely used to perform modeling and cluster analysis of high-dimensional data in a wide range of applications. Modes of mixture densities have been used with great success for organizing mixture components into homogenous groups. But the results are limited to normal mixtures. Beyond the clustering application existing research in this area has provided fundamental results regarding the upper bound of the number of modes, but they too are limited to normal mixtures. In this project, we wish to explore the modality of non-normal distributions and their application to real life problems.

Bayesian Modelling and Inference - Example Research Projects

Information about postgraduate research opportunities and how to apply can be found on the Postgraduate Research Study page. Below is a selection of projects that could be undertaken with our group.

Modelling genetic variation (MSc/PhD)

Supervisors: Vincent Macaulay
Relevant research groups: Bayesian Modelling and InferenceStatistical Modelling for Biology, Genetics and *omics

Variation in the distribution of different DNA sequences across individuals has been shaped by many processes which can be modelled probabilistically, processes such as demographic factors like prehistoric population movements, or natural selection. This project involves developing new techniques for teasing out information on those processes from the wealth of raw data that is now being generated by high-throughput genetic assays, and is likely to involve computationally-intensive sampling techniques to approximate the posterior distribution of parameters of interest. The characterization of the amount of population structure on different geographical scales will influence the design of experiments to identify the genetic variants that increase risk of complex diseases, such as diabetes or heart disease.

The evolution of shape (PhD)

Supervisors: Vincent Macaulay
Relevant research groups: Bayesian Modelling and InferenceModelling in Space and Time, Statistical Modelling for Biology, Genetics and *omics

Shapes of objects change in time. Organisms evolve and in the process change form: humans and chimpanzees derive from some common ancestor presumably different from either in shape. Designed objects are no different: an Art Deco tea pot from the 1920s might share some features with one from Ikea in 2010, but they are different. Mathematical models of evolution for certain data types, like the strings of As, Gs , Cs and Ts in our evolving DNA, are quite mature and allow us to learn about the relationships of the objects (their phylogeny or family tree), about the changes that happen to them in time (the evolutionary process) and about the ways objects were configured in the past (the ancestral states), by statistical techniques like phylogenetic analysis. Such techniques for shape data are still in their infancy. This project will develop novel statistical inference approaches (in a Bayesian context) for complex data objects, like functions, surfaces and shapes, using Gaussian-process models, with potential application in fields as diverse as language evolution, morphometrics and industrial design.

New methods for analysis of migratory navigation (PhD)

Supervisors: Janine Illian
Relevant research groups: Modelling in Space and TimeBayesian Modelling and InferenceComputational StatisticsEnvironmental, Ecological Sciences and Sustainability

Joint project with Dr Urška Demšar (University of St Andrews)

Migratory birds travel annually across vast expanses of oceans and continents to reach their destination with incredible accuracy. How they are able to do this using only locally available cues is still not fully understood. Migratory navigation consists of two processes: birds either identify the direction in which to fly (compass orientation) or the location where they are at a specific moment in time (geographic positioning). One of the possible ways they do this is to use information from the Earth’s magnetic field in the so-called geomagnetic navigation (Mouritsen 2018). While there is substantial evidence (both physiological and behavioural) that they do sense magnetic field (Deutschlander and Beason 2014), we however still do not know exactly which of the components of the field they use for orientation or positioning. We also do not understand how rapid changes in the field affect movement behaviour.

There is a possibility that birds can sense these rapid large changes and that this may affect their navigational process. To study this, we need to link accurate data on Earth’s magnetic field with animal tracking data. This has only become possible very recently through new spatial data science advances:  we developed the MagGeo tool, which links contemporaneous geomagnetic data from Swarm satellites of the European Space Agency with animal tracking data (Benitez Paez et al. 2021).

Linking geomagnetic data to animal tracking data however creates a highly-dimensional data set, which is difficult to explore. Typical analyses of contextual environmental information in ecology include representing contextual variables as co-variates in relatively simple statistical models (Brum Bastos et al. 2021), but this is not sufficient for studying detailed navigational behaviour. This project will analyse complex spatio-temporal data using computationally efficient statistical model fitting approches in a Bayesian context.

This project is fully based on open data to support reproducibility and open science. We will test our new methods by annotating publicly available bird tracking data (e.g. from repositories such as Movebank.org), using the open MagGeo tool and implementing our new methods as Free and Open Source Software (R/Python).

References

Benitez Paez F, Brum Bastos VdS, Beggan CD, Long JA and Demšar U, 2021. Fusion of wildlife tracking and satellite geomagnetic data for the study of animal migration. Movement Ecology, 9:31. https://doi.org/10.1186/s40462-021-00268-4

Brum Bastos VdS, Łos M, Long JA, Nelson T and Demšar U, 2021, Context-aware movement analysis in ecology: a systematic review. International Journal of Geographic Information Sciencehttps://doi.org/10.1080/13658816.2021.1962528

Deutschlander ME and Beason RC, 2014. Avian navigation and geographic positioning. Journal of Field Ornithology, 85(2):111–133. https://doi.org/10.1111/jofo.12055

Integrated spatio-temporal modelling for environmental data (PhD)

Supervisors: Janine Illian
Relevant research groups: Modelling in Space and TimeBayesian Modelling and InferenceComputational StatisticsEnvironmental, Ecological Sciences and Sustainability

(Jointly supervised by Peter Henrys, CEH)

The last decade has seen a proliferation of environmental data with vast quantities of information available from various sources. This has been due to a number of different factors including: the advent of sensor technologies; the provision of remotely sensed data from both drones and satellites; and the explosion in citizen science initiatives. These data represent a step change in the resolution of available data across space and time - sensors can be streaming data at a resolution of seconds whereas citizen science observations can be in the hundreds of thousands.  

Over the same period, the resources available for traditional field surveys have decreased dramatically whilst logistical issues (such as access to sites, ) have increased. This has severely impacted the ability for field survey campaigns to collect data at high spatial and temporal resolutions. It is exactly this sort of information that is required to fit models that can quantify and predict the spread of invasive species, for example. 

Whilst we have seen an explosion of data across various sources, there is no single source that provides both the spatial and temporal intensity that may be required when fitting complex spatio-temporal models (cf invasive species example) - each has its own advantages and benefits in terms of information content. There is therefore potentially huge benefit in beginning together data from these different sources within a consistent framework to exploit the benefits each offers and to understand processes at unprecedented resolutions/scales that would be impossible to monitor. 

Current approaches to combining data in this way are typically very bespoke and involve complex model structures that are not reusable outside of the particular application area. What is needed is an overarching generic methodological framework and associated software solutions to implement such analyses. Not only would such a framework provide the methodological basis to enable researchers to benefit from this big data revolution, but also the capability to change such analyses from being stand alone research projects in their own right, to more operational, standard analytical routines. 

FInally, such dynamic, integrated analyses could feedback into data collection initiatives to ensure optimal allocation of effort for traditional surveys or optimal power management for sensor networks. The major step change being that this optimal allocation of effort is conditional on other data that is available. So, for example, given the coverage and intensity of the citizen science data, where should we optimally send our paid surveyors? The idea is that information is collected at times and locations that provide the greatest benefit in understanding the underpinning stochastic processes. These two major issues - integrated analyses and adaptive sampling - ensure that environmental monitoring is fit for purpose and scientists, policy and industry can benefit from the big data revolution. 

This project will develop an integrated statistical modelling strategy that provides a single modelling framework for enabling quantification of ecosystem goods and services while accounting for the fundamental differences in different data streams. Data collected at different spatial resolutions can be used within the same model through projecting it into continuous space and projecting it back into the landscape level of interest.  As a result, decisions can be made at the relevant spatial scale and uncertainty is propagated through, facilitating appropriate decision making.

Statistical methodology for assessing the impacts of offshore renewable developments on marine wildlife (PhD)

Supervisors: Janine Illian
Relevant research groups: Modelling in Space and TimeBayesian Modelling and InferenceComputational StatisticsEnvironmental, Ecological Sciences and Sustainability

(jointly supervised by Esther Jones and Adam Butler, BIOSS)

Assessing the impacts of offshore renewable developments on marine wildlife is a critical component of the consenting process. A NERC-funded project, ECOWINGS, will provide a step-change in analysing predator-prey dynamics in the marine environment, collecting data across trophic levels against a backdrop of developing wind farms and climate change. Aerial survey and GPS data from multiple species of seabirds will be collected contemporaneously alongside prey data available over the whole water column from an automated surface vehicle and underwater drone.

These methods of data collection will generate 3D space and time profiles of predators and prey, creating a rich source of information and enormous potential for modelling and interrogation. The data present a unique opportunity for experimental design across a dynamic and changing marine ecosystem, which is heavily influenced by local and global anthropogenic activities. However, these data have complex intrinsic spatio-temporal properties, which are challenging to analyse. Significant statistical methods development could be achieved using this system as a case study, contributing to the scientific knowledge base not only in offshore renewables but more generally in the many circumstances where patchy ecological spatio-temporal data are available. 

This PhD project will develop spatio-temporal modelling methodology that will allow user to anaylse these exciting - and complex - data sets and help inform our knowledge on the impact of off-shore renewable on wildlife.

Bayesian variable selection for genetic and genomic studies (PhD)

Supervisors: Mayetri Gupta
Relevant research groups: Bayesian Modelling and InferenceComputational StatisticsStatistical Modelling for Biology, Genetics and *omics

An important issue in high-dimensional regression problems is the accurate and efficient estimation of models when, compared to the number of data points, a substantially larger number of potential predictors are present. Further complications arise with correlated predictors, leading to the breakdown of standard statistical models for inference; and the uncertain definition of the outcome variable, which is often a varying composition of several different observable traits. Examples of such problems arise in many scenarios in genomics- in determining expression patterns of genes that may be responsible for a type of cancer; and in determining which genetic mutations lead to higher risks for occurrence of a disease. This project involves developing broad and improved Bayesian methodologies for efficient inference in high-dimensional regression-type problems with complex multivariate outcomes, with a focus on genetic data applications.

The successful candidate should have a strong background in methodological and applied Statistics, expert skills in relevant statistical software or programming languages (such as R, C/C++/Python), and also have a deep interest in developing knowledge in cross-disciplinary topics in genomics. The candidate will be expected to consolidate and master an extensive range of topics in modern Statistical theory and applications during their PhD, including advanced Bayesian modelling and computation, latent variable models, machine learning, and methods for Big Data. The successful candidate will be considered for funding to cover domestic tuition fees, as well as paying a stipend at the Research Council rate for four years.

Bayesian statistical data integration of single-cell and bulk “OMICS” datasets with clinical parameters for accurate prediction of treatment outcomes in Rheumatoid Arthritis (PhD)

Supervisors: Mayetri Gupta
Relevant research groups: Bayesian Modelling and InferenceComputational StatisticsStatistical Modelling for Biology, Genetics and *omicsBiostatistics, Epidemiology and Health Applications

In recent years, many different computational methods to analyse biological data have been established: including DNA (Genomics), RNA (Transcriptomics), Proteins (proteomics) and Metabolomics, that captures more dynamic events. These methods were refined by the advent of single cell technology, where it is now possible to capture the transcriptomics profile of single cells, spatial arrangements of cells from flow methods or imaging methods like functional magnetic resonance imaging. At the same time, these OMICS data can be complemented with clinical data – measurement of patients, like age, smoking status, phenotype of disease or drug treatment. It is an interesting and important open statistical question how to combine data from different “modalities” (like transcriptome with clinical data or imaging data) in a statistically valid way, to compare different datasets and make justifiable statistical inferences. This PhD project will be jointly supervised with Dr. Thomas Otto and Prof. Stefan Siebert from the Institute of Infection, Immunity & Inflammation), you will explore how to combine different datasets using Bayesian latent variable modelling, focusing on clinical datasets from Rheumatoid Arthritis.

Funding Notes

The successful candidate will be considered for funding to cover domestic tuition fees, as well as paying a stipend at the Research Council rate for four years.

Bayesian Mixture Models for Spatio-Temporal Data (PhD)

Supervisors: Craig Anderson
Relevant research groups: Modelling in Space and Time, Bayesian Modelling and Inference, Biostatistics, Epidemiology and Health Applications

The prevalence of disease is typically not constant across space – instead the risk tends to vary from one region to another.  Some of this variability may be down to environmental conditions, but many of them are driven by socio-economic differences between regions, with poorer regions tending to have worse health than wealthier regions.  For example, within the the Greater Glasgow and Clyde region, where the World Health Organisation noted that life expectancy ranges from 54 in Calton to 82 in Lenzie, despite these areas being less than 10 miles apart. There is substantial value to health professionals and policymakers in identifying some of the causes behind these localised health inequalities.

Disease mapping is a field of statistical epidemiology which focuses on estimating the patterns of disease risk across a geographical region. The main goal of such mapping is typically to identify regions of high disease risk so that relevant public health interventions can be made. This project involves the development of statistical models which will enhance our understanding regional differences in the risk of suffering from major diseases by focusing on these localised health inequalities.

Standard Bayesian hierarchical models with a conditional autoregressive prior are frequently used for risk estimation in this context, but these models assume a smooth risk surface which is often not appropriate in practice. In reality, it will often be the case that different regions have vastly different risk profiles and require different data generating functions as a result.

In this work we propose a mixture model based approach which allows different sub-populations to be represented by different underlying statistical distributions within a single modelling framework. By integrating CAR models into mixture models, researchers can simultaneously account for spatial dependencies and identify distinct disease patterns within subpopulations.

 

Computational Statistics - Example Research Projects

Information about postgraduate research opportunities and how to apply can be found on the Postgraduate Research Study page. Below is a selection of projects that could be undertaken with our group.

Evaluating probabilistic forecasts in high-dimensional settings (PhD)

Supervisors: Jethro Browell
Relevant research groups: Modelling in Space and TimeComputational StatisticsApplied Probability and Stochastic Processes

Many decisions are informed by forecasts, and almost all forecasts are uncertain to some degree. Probabilistic forecasts quantify uncertainty to help improve decision-making and are playing an important role in fields including weather forecasting, economics, energy, and public policy. Evaluating the quality of past forecasts is essential to give forecasters and forecast users confidence in their current predictions, and to compare the performance of forecasting systems.

While the principles of probabilistic forecast evaluation have been established over the past 15 years, most notably that of “sharpness subject to calibration/reliability”, we lack a complete toolkit for applying these principles in many situations, especially those that arise in high-dimensional settings. Furthermore, forecast evaluation must be interpretable by forecast users as well as expert forecasts, and assigning value to marginal improvements in forecast quality remains a challenge in many sectors.

This PhD will develop new statistical methods for probabilistic forecast evaluation considering some of the following issues:

  • Verifying probabilistic calibration conditional on relevant covariates
  • Skill scores for multivariate probabilistic forecasts where “ideal” performance is unknowable
  • Assigning value to marginal forecast improvement though the convolution of utility functions and Murphey Diagrams
  • Development of the concept of “anticipated verification” and “predicting the of uncertainty of future forecasts”
  • Decomposing forecast misspecification (e.g. into spatial and temporal components)
  • Evaluation of Conformal Predictions

Good knowledge of multivariate statistics is essential, prior knowledge of probabilistic forecasting and forecast evaluation would be an advantage.

Adaptive probabilistic forecasting (PhD)

Supervisors: Jethro Browell
Relevant research groups: Modelling in Space and TimeComputational StatisticsApplied Probability and Stochastic Processes

Data-driven predictive models depend on the representativeness of data used in model selection and estimation. However, many processes change over time meaning that recent data is more representative than old data. In this situation, predictive models should track these changes, which is the aim of “online” or “adaptive” algorithms. Furthermore, many users of forecasts require probabilistic forecasts, which quantify uncertainty, to inform their decision-making. Existing adaptive methods such as Recursive Least Squares, the Kalman Filter have been very successful for adaptive point forecasting, but adaptive probabilistic forecasting has received little attention. This PhD will develop methods for adaptive probabilistic forecasting from a theoretical perspective and with a view to apply these methods to problems in at least one application area to be determined.

In the context of adaptive probabilistic forecasting, this PhD may consider:

  • Online estimation of Generalised Additive Models for Location Scale and Shape
  • Online/adaptive (multivariate) time series prediction
  • Online aggregation (of experts, or hierarchies)

A good knowledge of methods for time series analysis and regression is essential, familiarity with flexible regression (GAMs) and distributional regression (GAMLSS/quantile regression) would be an advantage.

New methods for analysis of migratory navigation (PhD)

Supervisors: Janine Illian
Relevant research groups: Modelling in Space and TimeBayesian Modelling and InferenceComputational StatisticsEnvironmental, Ecological Sciences and Sustainability

Joint project with Dr Urška Demšar (University of St Andrews)

Migratory birds travel annually across vast expanses of oceans and continents to reach their destination with incredible accuracy. How they are able to do this using only locally available cues is still not fully understood. Migratory navigation consists of two processes: birds either identify the direction in which to fly (compass orientation) or the location where they are at a specific moment in time (geographic positioning). One of the possible ways they do this is to use information from the Earth’s magnetic field in the so-called geomagnetic navigation (Mouritsen 2018). While there is substantial evidence (both physiological and behavioural) that they do sense magnetic field (Deutschlander and Beason 2014), we however still do not know exactly which of the components of the field they use for orientation or positioning. We also do not understand how rapid changes in the field affect movement behaviour.

There is a possibility that birds can sense these rapid large changes and that this may affect their navigational process. To study this, we need to link accurate data on Earth’s magnetic field with animal tracking data. This has only become possible very recently through new spatial data science advances:  we developed the MagGeo tool, which links contemporaneous geomagnetic data from Swarm satellites of the European Space Agency with animal tracking data (Benitez Paez et al. 2021).

Linking geomagnetic data to animal tracking data however creates a highly-dimensional data set, which is difficult to explore. Typical analyses of contextual environmental information in ecology include representing contextual variables as co-variates in relatively simple statistical models (Brum Bastos et al. 2021), but this is not sufficient for studying detailed navigational behaviour. This project will analyse complex spatio-temporal data using computationally efficient statistical model fitting approches in a Bayesian context.

This project is fully based on open data to support reproducibility and open science. We will test our new methods by annotating publicly available bird tracking data (e.g. from repositories such as Movebank.org), using the open MagGeo tool and implementing our new methods as Free and Open Source Software (R/Python).

References

Benitez Paez F, Brum Bastos VdS, Beggan CD, Long JA and Demšar U, 2021. Fusion of wildlife tracking and satellite geomagnetic data for the study of animal migration. Movement Ecology, 9:31. https://doi.org/10.1186/s40462-021-00268-4

Brum Bastos VdS, Łos M, Long JA, Nelson T and Demšar U, 2021, Context-aware movement analysis in ecology: a systematic review. International Journal of Geographic Information Sciencehttps://doi.org/10.1080/13658816.2021.1962528

Deutschlander ME and Beason RC, 2014. Avian navigation and geographic positioning. Journal of Field Ornithology, 85(2):111–133. https://doi.org/10.1111/jofo.12055

Integrated spatio-temporal modelling for environmental data (PhD)

Supervisors: Janine Illian
Relevant research groups: Modelling in Space and TimeBayesian Modelling and InferenceComputational StatisticsEnvironmental, Ecological Sciences and Sustainability

(Jointly supervised by Peter Henrys, CEH)

The last decade has seen a proliferation of environmental data with vast quantities of information available from various sources. This has been due to a number of different factors including: the advent of sensor technologies; the provision of remotely sensed data from both drones and satellites; and the explosion in citizen science initiatives. These data represent a step change in the resolution of available data across space and time - sensors can be streaming data at a resolution of seconds whereas citizen science observations can be in the hundreds of thousands.  

Over the same period, the resources available for traditional field surveys have decreased dramatically whilst logistical issues (such as access to sites, ) have increased. This has severely impacted the ability for field survey campaigns to collect data at high spatial and temporal resolutions. It is exactly this sort of information that is required to fit models that can quantify and predict the spread of invasive species, for example. 

Whilst we have seen an explosion of data across various sources, there is no single source that provides both the spatial and temporal intensity that may be required when fitting complex spatio-temporal models (cf invasive species example) - each has its own advantages and benefits in terms of information content. There is therefore potentially huge benefit in beginning together data from these different sources within a consistent framework to exploit the benefits each offers and to understand processes at unprecedented resolutions/scales that would be impossible to monitor. 

Current approaches to combining data in this way are typically very bespoke and involve complex model structures that are not reusable outside of the particular application area. What is needed is an overarching generic methodological framework and associated software solutions to implement such analyses. Not only would such a framework provide the methodological basis to enable researchers to benefit from this big data revolution, but also the capability to change such analyses from being stand alone research projects in their own right, to more operational, standard analytical routines. 

FInally, such dynamic, integrated analyses could feedback into data collection initiatives to ensure optimal allocation of effort for traditional surveys or optimal power management for sensor networks. The major step change being that this optimal allocation of effort is conditional on other data that is available. So, for example, given the coverage and intensity of the citizen science data, where should we optimally send our paid surveyors? The idea is that information is collected at times and locations that provide the greatest benefit in understanding the underpinning stochastic processes. These two major issues - integrated analyses and adaptive sampling - ensure that environmental monitoring is fit for purpose and scientists, policy and industry can benefit from the big data revolution. 

This project will develop an integrated statistical modelling strategy that provides a single modelling framework for enabling quantification of ecosystem goods and services while accounting for the fundamental differences in different data streams. Data collected at different spatial resolutions can be used within the same model through projecting it into continuous space and projecting it back into the landscape level of interest.  As a result, decisions can be made at the relevant spatial scale and uncertainty is propagated through, facilitating appropriate decision making. 

Statistical methodology for assessing the impacts of offshore renewable developments on marine wildlife (PhD)

Supervisors: Janine Illian
Relevant research groups: Modelling in Space and TimeBayesian Modelling and InferenceComputational StatisticsEnvironmental, Ecological Sciences and Sustainability

(jointly supervised by Esther Jones and Adam Butler, BIOSS)

Assessing the impacts of offshore renewable developments on marine wildlife is a critical component of the consenting process. A NERC-funded project, ECOWINGS, will provide a step-change in analysing predator-prey dynamics in the marine environment, collecting data across trophic levels against a backdrop of developing wind farms and climate change. Aerial survey and GPS data from multiple species of seabirds will be collected contemporaneously alongside prey data available over the whole water column from an automated surface vehicle and underwater drone.

These methods of data collection will generate 3D space and time profiles of predators and prey, creating a rich source of information and enormous potential for modelling and interrogation. The data present a unique opportunity for experimental design across a dynamic and changing marine ecosystem, which is heavily influenced by local and global anthropogenic activities. However, these data have complex intrinsic spatio-temporal properties, which are challenging to analyse. Significant statistical methods development could be achieved using this system as a case study, contributing to the scientific knowledge base not only in offshore renewables but more generally in the many circumstances where patchy ecological spatio-temporal data are available. 

This PhD project will develop spatio-temporal modelling methodology that will allow user to anaylse these exciting - and complex - data sets and help inform our knowledge on the impact of off-shore renewable on wildlife.

Bayesian variable selection for genetic and genomic studies (PhD)

Supervisors: Mayetri Gupta
Relevant research groups: Bayesian Modelling and InferenceComputational StatisticsStatistical Modelling for Biology, Genetics and *omics

An important issue in high-dimensional regression problems is the accurate and efficient estimation of models when, compared to the number of data points, a substantially larger number of potential predictors are present. Further complications arise with correlated predictors, leading to the breakdown of standard statistical models for inference; and the uncertain definition of the outcome variable, which is often a varying composition of several different observable traits. Examples of such problems arise in many scenarios in genomics- in determining expression patterns of genes that may be responsible for a type of cancer; and in determining which genetic mutations lead to higher risks for occurrence of a disease. This project involves developing broad and improved Bayesian methodologies for efficient inference in high-dimensional regression-type problems with complex multivariate outcomes, with a focus on genetic data applications.

The successful candidate should have a strong background in methodological and applied Statistics, expert skills in relevant statistical software or programming languages (such as R, C/C++/Python), and also have a deep interest in developing knowledge in cross-disciplinary topics in genomics. The candidate will be expected to consolidate and master an extensive range of topics in modern Statistical theory and applications during their PhD, including advanced Bayesian modelling and computation, latent variable models, machine learning, and methods for Big Data. The successful candidate will be considered for funding to cover domestic tuition fees, as well as paying a stipend at the Research Council rate for four years.

Bayesian statistical data integration of single-cell and bulk “OMICS” datasets with clinical parameters for accurate prediction of treatment outcomes in Rheumatoid Arthritis (PhD)

Supervisors: Mayetri Gupta
Relevant research groups: Bayesian Modelling and InferenceComputational StatisticsStatistical Modelling for Biology, Genetics and *omicsBiostatistics, Epidemiology and Health Applications

In recent years, many different computational methods to analyse biological data have been established: including DNA (Genomics), RNA (Transcriptomics), Proteins (proteomics) and Metabolomics, that captures more dynamic events. These methods were refined by the advent of single cell technology, where it is now possible to capture the transcriptomics profile of single cells, spatial arrangements of cells from flow methods or imaging methods like functional magnetic resonance imaging. At the same time, these OMICS data can be complemented with clinical data – measurement of patients, like age, smoking status, phenotype of disease or drug treatment. It is an interesting and important open statistical question how to combine data from different “modalities” (like transcriptome with clinical data or imaging data) in a statistically valid way, to compare different datasets and make justifiable statistical inferences. This PhD project will be jointly supervised with Dr. Thomas Otto and Prof. Stefan Siebert from the Institute of Infection, Immunity & Inflammation), you will explore how to combine different datasets using Bayesian latent variable modelling, focusing on clinical datasets from Rheumatoid Arthritis.

Funding Notes

The successful candidate will be considered for funding to cover domestic tuition fees, as well as paying a stipend at the Research Council rate for four years.

Analysis of spatially correlated functional data objects (PhD)

Supervisors: Surajit Ray
Relevant research groups: Modelling in Space and TimeComputational StatisticsNonparametric and Semi-parametric StatisticsImaging, Image Processing and Image Analysis

Historically, functional data analysis techniques have widely been used to analyze traditional time series data, albeit from a different perspective. Of late, FDA techniques are increasingly being used in domains such as environmental science, where the data are spatio-temporal in nature and hence is it typical to consider such data as functional data where the functions are correlated in time or space. An example where modeling the dependencies is crucial is in analyzing remotely sensed data observed over a number of years across the surface of the earth, where each year forms a single functional data object. One might be interested in decomposing the overall variation across space and time and attribute it to covariates of interest. Another interesting class of data with dependence structure consists of weather data on several variables collected from balloons where the domain of the functions is a vertical strip in the atmosphere, and the data are spatially correlated. One of the challenges in such type of data is the problem of missingness, to address which one needs develop appropriate spatial smoothing techniques for spatially dependent functional data. There are also interesting design of experiment issues, as well as questions of data calibration to account for the variability in sensing instruments. Inspite of the research initiative in analyzing dependent functional data there are several unresolved problems, which the student will work on:

  • robust statistical models for incorporating temporal and spatial dependencies in functional data
  • developing reliable prediction and interpolation techniques for dependent functional data
  • developing inferential framework for testing hypotheses related to simplified dependent structures
  • analysing sparsely observed functional data by borrowing information from neighbours
  • visualisation of data summaries associated with dependent functional data
  • Clustering of functional data

Emulation and Uncertainty Quantification - Example Research Projects

Information about postgraduate research opportunities and how to apply can be found on the Postgraduate Research Study page. Below is a selection of projects that could be undertaken with our group.

Structured ML for physical systems (PhD or MSc)

Supervisors: Lawrence Bull
Relevant research groups: Machine Learning and AIEmulation and Uncertainty Quantification

When using Machine Learning (ML) for science and engineering, an alternative mindset is required to build sensible representations from data. Unlike other applications (e.g. large language models), the datasets are relatively small and curated - i.e. they are collected via experiments rather than scraped from the internet. The limited variance of training data typically renders learning by ‘brute force’ infeasible. Instead, we must encode domain-specific knowledge within ML algorithms to enforce structure and constrain the space of possible models.

This project covers ML for physical systems (Karniadakis, 2021) and looks to integrate machine learning with applied mathematics - fusing scientific knowledge with insights from data. Methods will investigate various levels of constraints on ML predictions - including smoothness, invariances, etc. Relevant topics include:

Generating deep fake left ventricles: a step towards personalised heart treatments (PhD)

Supervisors: Andrew Elliott, Vinny Davies, Hao Gao
Relevant research groups: Machine Learning and AIEmulation and Uncertainty QuantificationBiostatistics, Epidemiology and Health ApplicationsImaging, Image Processing and Image Analysis

Personalised medicine is an exciting avenue in the field of cardiac healthcare where an understanding of patient-specific mechanisms can lead to improved treatments (Gao et al., 2017). The use of mathematical models to link the underlying properties of the heart with cardiac imaging offers the possibility of obtaining important parameters of heart function non-invasively (Gao et al., 2015). Unfortunately, current estimation methods rely on complex mathematical forward simulations, resulting in a solution taking hours, a time frame not suitable for real-time treatment decisions. To increase the applicability of these methods, statistical emulation methods have been proposed as an efficient way of estimating the parameters (Davies et al., 2019Noè et al., 2019). In this approach, simulations of the mathematical model are run in advance and then machine learning based methods are used to estimate the relationship between the cardiac imaging and the parameters of interest. These methods are, however, limited by our ability to understand the how cardiac geometry varies across patients which is in term limited by the amount of data available (Romaszko et al., 2019). In this project we will look at AI based methods for generating fake cardiac geometries which can be used to increase the amount of data (Qiao et al., 2023). We will explore different types of AI generation, including Generative Adversarial Networks or Variational Autoencoders, to understand how we can generate better 3D and 4D models of the fake left ventricles and create an improved emulation strategy that can make use of them.

Machine Learning and AI - Example Research Projects

Information about postgraduate research opportunities and how to apply can be found on the Postgraduate Research Study page. Below is a selection of projects that could be undertaken with our group.

Structured ML for physical systems (PhD or MSc)

Supervisors: Lawrence Bull
Relevant research groups: Machine Learning and AI, Emulation and Uncertainty Quantification

When using Machine Learning (ML) for science and engineering, an alternative mindset is required to build sensible representations from data. Unlike other applications (e.g. large language models), the datasets are relatively small and curated - i.e. they are collected via experiments rather than scraped from the internet. The limited variance of training data typically renders learning by ‘brute force’ infeasible. Instead, we must encode domain-specific knowledge within ML algorithms to enforce structure and constrain the space of possible models.

This project covers ML for physical systems (Karniadakis, 2021) and looks to integrate machine learning with applied mathematics - fusing scientific knowledge with insights from data. Methods will investigate various levels of constraints on ML predictions - including smoothness, invariances, etc. Relevant topics include:

Medical image segmentation and uncertainty quantification (PhD)

Supervisors: Surajit Ray
Relevant research groups: Machine Learning and AIImaging, Image Processing and Image Analysis

This project focuses on the application of medical imaging and uncertainty quantification for the detection of tumours. The project aims to provide clinicians with accurate, non-invasive methods for detecting and classifying the presence of malignant and benign tumours. It seeks to combine advanced medical imaging technologies such as ultrasound, computed tomography (CT) and magnetic resonance imaging (MRI) with the latest artificial intelligence algorithms. These methods will automate the detection process and may be used for determining malignancy with a high degree of accuracy. Uncertainty quantification (UQ) techniques will help generate a more precise prediction for tumour malignancy by providing a characterisation of the degree of uncertainty associated with the diagnosis. The combination of medical imaging and UQ will significantly decrease the requirement for performing invasive medical procedures such as biopsies. This will improve the accuracy of the tumour detection process and reduce the duration of diagnosis. The project will also benefit from the development of novel image processing algorithms (e.g. deep learning) and machine learning models. These algorithms and models will help improve the accuracy of the tumour detection process and assist clinicians in making the best treatment decisions.

Generating deep fake left ventricles: a step towards personalised heart treatments (PhD)

Supervisors: Andrew Elliott, Vinny Davies, Hao Gao
Relevant research groups: Machine Learning and AI, Emulation and Uncertainty Quantification, Biostatistics, Epidemiology and Health Applications, Imaging, Image Processing and Image Analysis

Personalised medicine is an exciting avenue in the field of cardiac healthcare where an understanding of patient-specific mechanisms can lead to improved treatments (Gao et al., 2017). The use of mathematical models to link the underlying properties of the heart with cardiac imaging offers the possibility of obtaining important parameters of heart function non-invasively (Gao et al., 2015). Unfortunately, current estimation methods rely on complex mathematical forward simulations, resulting in a solution taking hours, a time frame not suitable for real-time treatment decisions. To increase the applicability of these methods, statistical emulation methods have been proposed as an efficient way of estimating the parameters (Davies et al., 2019Noè et al., 2019). In this approach, simulations of the mathematical model are run in advance and then machine learning based methods are used to estimate the relationship between the cardiac imaging and the parameters of interest. These methods are, however, limited by our ability to understand the how cardiac geometry varies across patients which is in term limited by the amount of data available (Romaszko et al., 2019). In this project we will look at AI based methods for generating fake cardiac geometries which can be used to increase the amount of data (Qiao et al., 2023). We will explore different types of AI generation, including Generative Adversarial Networks or Variational Autoencoders, to understand how we can generate better 3D and 4D models of the fake left ventricles and create an improved emulation strategy that can make use of them.

Modelling in Space and Time - Example Research Projects

Information about postgraduate research opportunities and how to apply can be found on the Postgraduate Research Study page. Below is a selection of projects that could be undertaken with our group.

Sampling strategies for environmental monitoring networks

Supervisors: Claire Miller, Craig Alexander, Craig Wilkie
Relevant research groups: Modelling in Space and Time, Environmental, Ecological, and Sustainability

In recent years, there has been a lot of work done on investigating how to monitor environmental variables in the most efficient way.  Environmental variables, such as pollutants in water, can be monitored through, for example, in-situ sampling, automatic in-situ sensors or remote sensing.  However, each sampling approach has different levels of accuracy and is available at different spatial and temporal resolutions. 

Environmental regulators and industry all have a responsibility and commitment to monitoring environmental standards and mitigating the potential for increases to levels of pollutants.  At a time of world-wide budgetary pressures, the most efficient monitoring schemes are required.  However, the mechanisms of monitoring can also be detrimental to the environment e.g. through more visits to a site or lab/computer processing creating a higher environmental footprint.

The aim of the PhD is to extend work already carried out on the optimal design of monitoring networks for spatiotemporal models. Specifically, to identify the spatiotemporal sampling designs that can balance budgetary requirements and environmental impact, with a view to developing and enhancing online tools (e.g. GWSDAT) to provide automatic guidance to practitioners. The latter can then integrate this guidance into their assessment and development of the most optimal monitoring network. This will require statistical methodological development, working on computationally efficient implementations and software development.

The PhD will be jointly supervised by partners from industry and hence the successful candidate will additionally engage in knowledge exchange/transfer, training and networking in this sector.

Evaluating probabilistic forecasts in high-dimensional settings (PhD)

Supervisors: Jethro Browell
Relevant research groups: Modelling in Space and Time, Computational Statistics, Applied Probability and Stochastic Processes

Many decisions are informed by forecasts, and almost all forecasts are uncertain to some degree. Probabilistic forecasts quantify uncertainty to help improve decision-making and are playing an important role in fields including weather forecasting, economics, energy, and public policy. Evaluating the quality of past forecasts is essential to give forecasters and forecast users confidence in their current predictions, and to compare the performance of forecasting systems.

While the principles of probabilistic forecast evaluation have been established over the past 15 years, most notably that of “sharpness subject to calibration/reliability”, we lack a complete toolkit for applying these principles in many situations, especially those that arise in high-dimensional settings. Furthermore, forecast evaluation must be interpretable by forecast users as well as expert forecasts, and assigning value to marginal improvements in forecast quality remains a challenge in many sectors.

This PhD will develop new statistical methods for probabilistic forecast evaluation considering some of the following issues:

  • Verifying probabilistic calibration conditional on relevant covariates
  • Skill scores for multivariate probabilistic forecasts where “ideal” performance is unknowable
  • Assigning value to marginal forecast improvement though the convolution of utility functions and Murphey Diagrams
  • Development of the concept of “anticipated verification” and “predicting the of uncertainty of future forecasts”
  • Decomposing forecast misspecification (e.g. into spatial and temporal components)
  • Evaluation of Conformal Predictions

Good knowledge of multivariate statistics is essential, prior knowledge of probabilistic forecasting and forecast evaluation would be an advantage.

Adaptive probabilistic forecasting (PhD)

Supervisors: Jethro Browell
Relevant research groups: Modelling in Space and Time, Computational Statistics, Applied Probability and Stochastic Processes

Data-driven predictive models depend on the representativeness of data used in model selection and estimation. However, many processes change over time meaning that recent data is more representative than old data. In this situation, predictive models should track these changes, which is the aim of “online” or “adaptive” algorithms. Furthermore, many users of forecasts require probabilistic forecasts, which quantify uncertainty, to inform their decision-making. Existing adaptive methods such as Recursive Least Squares, the Kalman Filter have been very successful for adaptive point forecasting, but adaptive probabilistic forecasting has received little attention. This PhD will develop methods for adaptive probabilistic forecasting from a theoretical perspective and with a view to apply these methods to problems in at least one application area to be determined.

In the context of adaptive probabilistic forecasting, this PhD may consider:

  • Online estimation of Generalised Additive Models for Location Scale and Shape
  • Online/adaptive (multivariate) time series prediction
  • Online aggregation (of experts, or hierarchies)

A good knowledge of methods for time series analysis and regression is essential, familiarity with flexible regression (GAMs) and distributional regression (GAMLSS/quantile regression) would be an advantage.

The evolution of shape (PhD)

Supervisors: Vincent Macaulay
Relevant research groups: Bayesian Modelling and Inference, Modelling in Space and Time, Statistical Modelling for Biology, Genetics and *omics

Shapes of objects change in time. Organisms evolve and in the process change form: humans and chimpanzees derive from some common ancestor presumably different from either in shape. Designed objects are no different: an Art Deco tea pot from the 1920s might share some features with one from Ikea in 2010, but they are different. Mathematical models of evolution for certain data types, like the strings of As, Gs , Cs and Ts in our evolving DNA, are quite mature and allow us to learn about the relationships of the objects (their phylogeny or family tree), about the changes that happen to them in time (the evolutionary process) and about the ways objects were configured in the past (the ancestral states), by statistical techniques like phylogenetic analysis. Such techniques for shape data are still in their infancy. This project will develop novel statistical inference approaches (in a Bayesian context) for complex data objects, like functions, surfaces and shapes, using Gaussian-process models, with potential application in fields as diverse as language evolution, morphometrics and industrial design.

New methods for analysis of migratory navigation (PhD)

Supervisors: Janine Illian
Relevant research groups: Modelling in Space and Time, Bayesian Modelling and Inference, Computational Statistics, Environmental, Ecological Sciences and Sustainability

Joint project with Dr Urška Demšar (University of St Andrews)

Migratory birds travel annually across vast expanses of oceans and continents to reach their destination with incredible accuracy. How they are able to do this using only locally available cues is still not fully understood. Migratory navigation consists of two processes: birds either identify the direction in which to fly (compass orientation) or the location where they are at a specific moment in time (geographic positioning). One of the possible ways they do this is to use information from the Earth’s magnetic field in the so-called geomagnetic navigation (Mouritsen 2018). While there is substantial evidence (both physiological and behavioural) that they do sense magnetic field (Deutschlander and Beason 2014), we however still do not know exactly which of the components of the field they use for orientation or positioning. We also do not understand how rapid changes in the field affect movement behaviour.

There is a possibility that birds can sense these rapid large changes and that this may affect their navigational process. To study this, we need to link accurate data on Earth’s magnetic field with animal tracking data. This has only become possible very recently through new spatial data science advances:  we developed the MagGeo tool, which links contemporaneous geomagnetic data from Swarm satellites of the European Space Agency with animal tracking data (Benitez Paez et al. 2021).

Linking geomagnetic data to animal tracking data however creates a highly-dimensional data set, which is difficult to explore. Typical analyses of contextual environmental information in ecology include representing contextual variables as co-variates in relatively simple statistical models (Brum Bastos et al. 2021), but this is not sufficient for studying detailed navigational behaviour. This project will analyse complex spatio-temporal data using computationally efficient statistical model fitting approches in a Bayesian context.

This project is fully based on open data to support reproducibility and open science. We will test our new methods by annotating publicly available bird tracking data (e.g. from repositories such as Movebank.org), using the open MagGeo tool and implementing our new methods as Free and Open Source Software (R/Python).

References

Benitez Paez F, Brum Bastos VdS, Beggan CD, Long JA and Demšar U, 2021. Fusion of wildlife tracking and satellite geomagnetic data for the study of animal migration. Movement Ecology, 9:31. https://doi.org/10.1186/s40462-021-00268-4

Brum Bastos VdS, Łos M, Long JA, Nelson T and Demšar U, 2021, Context-aware movement analysis in ecology: a systematic review. International Journal of Geographic Information Sciencehttps://doi.org/10.1080/13658816.2021.1962528

Deutschlander ME and Beason RC, 2014. Avian navigation and geographic positioning. Journal of Field Ornithology, 85(2):111–133. https://doi.org/10.1111/jofo.12055

Integrated spatio-temporal modelling for environmental data (PhD)

Supervisors: Janine Illian
Relevant research groups: Modelling in Space and TimeBayesian Modelling and InferenceComputational StatisticsEnvironmental, Ecological Sciences and Sustainability

(Jointly supervised by Peter Henrys, CEH)

The last decade has seen a proliferation of environmental data with vast quantities of information available from various sources. This has been due to a number of different factors including: the advent of sensor technologies; the provision of remotely sensed data from both drones and satellites; and the explosion in citizen science initiatives. These data represent a step change in the resolution of available data across space and time - sensors can be streaming data at a resolution of seconds whereas citizen science observations can be in the hundreds of thousands.  

Over the same period, the resources available for traditional field surveys have decreased dramatically whilst logistical issues (such as access to sites, ) have increased. This has severely impacted the ability for field survey campaigns to collect data at high spatial and temporal resolutions. It is exactly this sort of information that is required to fit models that can quantify and predict the spread of invasive species, for example. 

Whilst we have seen an explosion of data across various sources, there is no single source that provides both the spatial and temporal intensity that may be required when fitting complex spatio-temporal models (cf invasive species example) - each has its own advantages and benefits in terms of information content. There is therefore potentially huge benefit in beginning together data from these different sources within a consistent framework to exploit the benefits each offers and to understand processes at unprecedented resolutions/scales that would be impossible to monitor. 

Current approaches to combining data in this way are typically very bespoke and involve complex model structures that are not reusable outside of the particular application area. What is needed is an overarching generic methodological framework and associated software solutions to implement such analyses. Not only would such a framework provide the methodological basis to enable researchers to benefit from this big data revolution, but also the capability to change such analyses from being stand alone research projects in their own right, to more operational, standard analytical routines. 

FInally, such dynamic, integrated analyses could feedback into data collection initiatives to ensure optimal allocation of effort for traditional surveys or optimal power management for sensor networks. The major step change being that this optimal allocation of effort is conditional on other data that is available. So, for example, given the coverage and intensity of the citizen science data, where should we optimally send our paid surveyors? The idea is that information is collected at times and locations that provide the greatest benefit in understanding the underpinning stochastic processes. These two major issues - integrated analyses and adaptive sampling - ensure that environmental monitoring is fit for purpose and scientists, policy and industry can benefit from the big data revolution. 

This project will develop an integrated statistical modelling strategy that provides a single modelling framework for enabling quantification of ecosystem goods and services while accounting for the fundamental differences in different data streams. Data collected at different spatial resolutions can be used within the same model through projecting it into continuous space and projecting it back into the landscape level of interest.  As a result, decisions can be made at the relevant spatial scale and uncertainty is propagated through, facilitating appropriate decision making.

Statistical methodology for assessing the impacts of offshore renewable developments on marine wildlife (PhD)

Supervisors: Janine Illian
Relevant research groups: Modelling in Space and TimeBayesian Modelling and InferenceComputational StatisticsEnvironmental, Ecological Sciences and Sustainability

(jointly supervised by Esther Jones and Adam Butler, BIOSS)

Assessing the impacts of offshore renewable developments on marine wildlife is a critical component of the consenting process. A NERC-funded project, ECOWINGS, will provide a step-change in analysing predator-prey dynamics in the marine environment, collecting data across trophic levels against a backdrop of developing wind farms and climate change. Aerial survey and GPS data from multiple species of seabirds will be collected contemporaneously alongside prey data available over the whole water column from an automated surface vehicle and underwater drone.

These methods of data collection will generate 3D space and time profiles of predators and prey, creating a rich source of information and enormous potential for modelling and interrogation. The data present a unique opportunity for experimental design across a dynamic and changing marine ecosystem, which is heavily influenced by local and global anthropogenic activities. However, these data have complex intrinsic spatio-temporal properties, which are challenging to analyse. Significant statistical methods development could be achieved using this system as a case study, contributing to the scientific knowledge base not only in offshore renewables but more generally in the many circumstances where patchy ecological spatio-temporal data are available. 

This PhD project will develop spatio-temporal modelling methodology that will allow user to anaylse these exciting - and complex - data sets and help inform our knowledge on the impact of off-shore renewable on wildlife. 

Analysis of spatially correlated functional data objects (PhD)

Supervisors: Surajit Ray
Relevant research groups: Modelling in Space and TimeComputational StatisticsNonparametric and Semi-parametric StatisticsImaging, Image Processing and Image Analysis

Historically, functional data analysis techniques have widely been used to analyze traditional time series data, albeit from a different perspective. Of late, FDA techniques are increasingly being used in domains such as environmental science, where the data are spatio-temporal in nature and hence is it typical to consider such data as functional data where the functions are correlated in time or space. An example where modeling the dependencies is crucial is in analyzing remotely sensed data observed over a number of years across the surface of the earth, where each year forms a single functional data object. One might be interested in decomposing the overall variation across space and time and attribute it to covariates of interest. Another interesting class of data with dependence structure consists of weather data on several variables collected from balloons where the domain of the functions is a vertical strip in the atmosphere, and the data are spatially correlated. One of the challenges in such type of data is the problem of missingness, to address which one needs develop appropriate spatial smoothing techniques for spatially dependent functional data. There are also interesting design of experiment issues, as well as questions of data calibration to account for the variability in sensing instruments. Inspite of the research initiative in analyzing dependent functional data there are several unresolved problems, which the student will work on:

  • robust statistical models for incorporating temporal and spatial dependencies in functional data
  • developing reliable prediction and interpolation techniques for dependent functional data
  • developing inferential framework for testing hypotheses related to simplified dependent structures
  • analysing sparsely observed functional data by borrowing information from neighbours
  • visualisation of data summaries associated with dependent functional data
  • Clustering of functional data

Estimating the effects of air pollution on human health (PhD)

Supervisors: Duncan Lee
Relevant research groups: Modelling in Space and TimeBiostatistics, Epidemiology and Health Applications

The health impact of exposure to air pollution is thought to reduce average life expectancy by six months, with an estimated equivalent health cost of 19 billion each year (from DEFRA). These effects have been estimated using statistical models, which quantify the impact on human health of exposure in both the short and the long term. However, the estimation of such effects is challenging, because individual level measures of health and pollution exposure are not available. Therefore, the majority of studies are conducted at the population level, and the resulting inference can only be made about the effects of pollution on overall population health. However, the data used in such studies are spatially misaligned, as the health data relate to extended areas such as cities or electoral wards, while the pollution concentrations are measured at individual locations. Furthermore, pollution monitors are typically located where concentrations are thought to be highest, known as preferential sampling, which is likely to result in overly high measurements being recorded. This project aims to develop statistical methodology to address these problems, and thus provide a less biased estimate of the effects of pollution on health than are currently produced.

Mapping disease risk in space and time (PhD)

Supervisors: Duncan Lee
Relevant research groups: Modelling in Space and TimeBiostatistics, Epidemiology and Health Applications

Disease risk varies over space and time, due to similar variation in environmental exposures such as air pollution and risk inducing behaviours such as smoking.  Modelling the spatio-temporal pattern in disease risk is known as disease mapping, and the aims are to: quantify the spatial pattern in disease risk to determine the extent of health inequalities,  determine whether there has been any increase or reduction in the risk over time, identify the locations of clusters of areas at elevated risk, and quantify the impact of exposures, such as air pollution, on disease risk. I am working on all these related problems at present, and I have PhD projects in all these areas.

Bayesian Mixture Models for Spatio-Temporal Data (PhD)

Supervisors: Craig Anderson
Relevant research groups: Modelling in Space and Time, Bayesian Modelling and Inference, Biostatistics, Epidemiology and Health Applications

The prevalence of disease is typically not constant across space – instead the risk tends to vary from one region to another.  Some of this variability may be down to environmental conditions, but many of them are driven by socio-economic differences between regions, with poorer regions tending to have worse health than wealthier regions.  For example, within the the Greater Glasgow and Clyde region, where the World Health Organisation noted that life expectancy ranges from 54 in Calton to 82 in Lenzie, despite these areas being less than 10 miles apart. There is substantial value to health professionals and policymakers in identifying some of the causes behind these localised health inequalities.

Disease mapping is a field of statistical epidemiology which focuses on estimating the patterns of disease risk across a geographical region. The main goal of such mapping is typically to identify regions of high disease risk so that relevant public health interventions can be made. This project involves the development of statistical models which will enhance our understanding regional differences in the risk of suffering from major diseases by focusing on these localised health inequalities.

Standard Bayesian hierarchical models with a conditional autoregressive prior are frequently used for risk estimation in this context, but these models assume a smooth risk surface which is often not appropriate in practice. In reality, it will often be the case that different regions have vastly different risk profiles and require different data generating functions as a result.

In this work we propose a mixture model based approach which allows different sub-populations to be represented by different underlying statistical distributions within a single modelling framework. By integrating CAR models into mixture models, researchers can simultaneously account for spatial dependencies and identify distinct disease patterns within subpopulations.

 

Nonparametric and Semi-parametric Statistics - Example Research Projects

Information about postgraduate research opportunities and how to apply can be found on the Postgraduate Research Study page. Below is a selection of projects that could be undertaken with our group.

Causal inference in noisy social networks (PhD)

Supervisors: Vanessa McNealis
Relevant research groups: Nonparametric and Semi-parametric StatisticsBiostatistics, Epidemiology and Health Applications, Social and Urban Studies

One core task of science is causal inference, yet distinguishing causality from spurious associations in observational data can be challenging. Statistical causal inference provides a framework to define causal effects, specify assumptions for identifying causal effects, and assess sensitivity of causal estimators to these assumptions.

Recent interest has focused on causal inference under interference (or spillover), where one individual’s treatment affects the outcomes of others. Social network data are particularly valuable for this purpose, as they offer information about connections between individuals, revealing potential pathways for interference. For instance, in the National Longitudinal Study of Adolescent Health (Add Health), peer influences among adolescents provide an ideal case for studying spillover, especially as they relate to behavioural and academic outcomes. However, one challenge for Add Health is the very high level of missing edge variable data and censoring present, posing challenges since many methods for evaluating spillover effects assume fully observed networks.

This PhD will develop statistical methods for causal inference under network interference with noise, considering the following issues/approaches:

  • Bias characterization in the presence of missing or uncertain edge information
  • Semi-parametric inference
  • Propensity score methods
  • Multiple imputation for network data

A good knowledge of methods for survey sampling and regression is essential, familiarity with causal inference, statistical methods for coarse data, and semi-parametric inference would be an advantage.

Analysis of spatially correlated functional data objects (PhD)

Supervisors: Surajit Ray
Relevant research groups: Modelling in Space and TimeComputational StatisticsNonparametric and Semi-parametric StatisticsImaging, Image Processing and Image Analysis

Historically, functional data analysis techniques have widely been used to analyze traditional time series data, albeit from a different perspective. Of late, FDA techniques are increasingly being used in domains such as environmental science, where the data are spatio-temporal in nature and hence is it typical to consider such data as functional data where the functions are correlated in time or space. An example where modeling the dependencies is crucial is in analyzing remotely sensed data observed over a number of years across the surface of the earth, where each year forms a single functional data object. One might be interested in decomposing the overall variation across space and time and attribute it to covariates of interest. Another interesting class of data with dependence structure consists of weather data on several variables collected from balloons where the domain of the functions is a vertical strip in the atmosphere, and the data are spatially correlated. One of the challenges in such type of data is the problem of missingness, to address which one needs develop appropriate spatial smoothing techniques for spatially dependent functional data. There are also interesting design of experiment issues, as well as questions of data calibration to account for the variability in sensing instruments. Inspite of the research initiative in analyzing dependent functional data there are several unresolved problems, which the student will work on:

  • robust statistical models for incorporating temporal and spatial dependencies in functional data
  • developing reliable prediction and interpolation techniques for dependent functional data
  • developing inferential framework for testing hypotheses related to simplified dependent structures
  • analysing sparsely observed functional data by borrowing information from neighbours
  • visualisation of data summaries associated with dependent functional data
  • Clustering of functional data

Modality of mixtures of distributions (PhD)

Supervisors: Surajit Ray
Relevant research groups: Nonparametric and Semi-parametric StatisticsApplied Probability and Stochastic ProcessesStatistical Modelling for Biology, Genetics and *omicsBiostatistics, Epidemiology and Health Applications

Finite mixtures provide a flexible and powerful tool for fitting univariate and multivariate distributions that cannot be captured by standard statistical distributions. In particular, multivariate mixtures have been widely used to perform modeling and cluster analysis of high-dimensional data in a wide range of applications. Modes of mixture densities have been used with great success for organizing mixture components into homogenous groups. But the results are limited to normal mixtures. Beyond the clustering application existing research in this area has provided fundamental results regarding the upper bound of the number of modes, but they too are limited to normal mixtures. In this project, we wish to explore the modality of non-normal distributions and their application to real life problems.

Interdisciplinary Impact

Biostatistics, Epidemiology and Health Applications - Example Research Projects

Information about postgraduate research opportunities and how to apply can be found on the Postgraduate Research Study page. Below is a selection of projects that could be undertaken with our group.

Causal inference in noisy social networks (PhD)

Supervisors: Vanessa McNealis
Relevant research groups: Nonparametric and Semi-parametric StatisticsBiostatistics, Epidemiology and Health ApplicationsSocial and Urban Studies

One core task of science is causal inference, yet distinguishing causality from spurious associations in observational data can be challenging. Statistical causal inference provides a framework to define causal effects, specify assumptions for identifying causal effects, and assess sensitivity of causal estimators to these assumptions.

Recent interest has focused on causal inference under interference (or spillover), where one individual’s treatment affects the outcomes of others. Social network data are particularly valuable for this purpose, as they offer information about connections between individuals, revealing potential pathways for interference. For instance, in the National Longitudinal Study of Adolescent Health (Add Health), peer influences among adolescents provide an ideal case for studying spillover, especially as they relate to behavioural and academic outcomes. However, one challenge for Add Health is the very high level of missing edge variable data and censoring present, posing challenges since many methods for evaluating spillover effects assume fully observed networks.

This PhD will develop statistical methods for causal inference under network interference with noise, considering the following issues/approaches:

  • Bias characterization in the presence of missing or uncertain edge information
  • Semi-parametric inference
  • Propensity score methods
  • Multiple imputation for network data

A good knowledge of methods for survey sampling and regression is essential, familiarity with causal inference, statistical methods for coarse data, and semi-parametric inference would be an advantage.

Bayesian statistical data integration of single-cell and bulk “OMICS” datasets with clinical parameters for accurate prediction of treatment outcomes in Rheumatoid Arthritis (PhD)

Supervisors: Mayetri Gupta
Relevant research groups: Bayesian Modelling and InferenceComputational StatisticsVincent MacaulayBiostatistics, Epidemiology and Health Applications

In recent years, many different computational methods to analyse biological data have been established: including DNA (Genomics), RNA (Transcriptomics), Proteins (proteomics) and Metabolomics, that captures more dynamic events. These methods were refined by the advent of single cell technology, where it is now possible to capture the transcriptomics profile of single cells, spatial arrangements of cells from flow methods or imaging methods like functional magnetic resonance imaging. At the same time, these OMICS data can be complemented with clinical data – measurement of patients, like age, smoking status, phenotype of disease or drug treatment. It is an interesting and important open statistical question how to combine data from different “modalities” (like transcriptome with clinical data or imaging data) in a statistically valid way, to compare different datasets and make justifiable statistical inferences. This PhD project will be jointly supervised with Dr. Thomas Otto and Prof. Stefan Siebert from the Institute of Infection, Immunity & Inflammation), you will explore how to combine different datasets using Bayesian latent variable modelling, focusing on clinical datasets from Rheumatoid Arthritis.

Modality of mixtures of distributions (PhD)

Supervisors: Surajit Ray
Relevant research groups: Nonparametric and Semi-parametric StatisticsApplied Probability and Stochastic ProcessesStatistical Modelling for Biology, Genetics and *omicsBiostatistics, Epidemiology and Health Applications

Finite mixtures provide a flexible and powerful tool for fitting univariate and multivariate distributions that cannot be captured by standard statistical distributions. In particular, multivariate mixtures have been widely used to perform modeling and cluster analysis of high-dimensional data in a wide range of applications. Modes of mixture densities have been used with great success for organizing mixture components into homogenous groups. But the results are limited to normal mixtures. Beyond the clustering application existing research in this area has provided fundamental results regarding the upper bound of the number of modes, but they too are limited to normal mixtures. In this project, we wish to explore the modality of non-normal distributions and their application to real life problems

Estimating the effects of air pollution on human health (PhD)

Supervisors: Duncan Lee
Relevant research groups: Modelling in Space and TimeBiostatistics, Epidemiology and Health Applications

The health impact of exposure to air pollution is thought to reduce average life expectancy by six months, with an estimated equivalent health cost of 19 billion each year (from DEFRA). These effects have been estimated using statistical models, which quantify the impact on human health of exposure in both the short and the long term. However, the estimation of such effects is challenging, because individual level measures of health and pollution exposure are not available. Therefore, the majority of studies are conducted at the population level, and the resulting inference can only be made about the effects of pollution on overall population health. However, the data used in such studies are spatially misaligned, as the health data relate to extended areas such as cities or electoral wards, while the pollution concentrations are measured at individual locations. Furthermore, pollution monitors are typically located where concentrations are thought to be highest, known as preferential sampling, which is likely to result in overly high measurements being recorded. This project aims to develop statistical methodology to address these problems, and thus provide a less biased estimate of the effects of pollution on health than are currently produced.

Mapping disease risk in space and time (PhD)

Supervisors: Duncan Lee
Relevant research groups: Modelling in Space and TimeBiostatistics, Epidemiology and Health Applications

Disease risk varies over space and time, due to similar variation in environmental exposures such as air pollution and risk inducing behaviours such as smoking.  Modelling the spatio-temporal pattern in disease risk is known as disease mapping, and the aims are to: quantify the spatial pattern in disease risk to determine the extent of health inequalities,  determine whether there has been any increase or reduction in the risk over time, identify the locations of clusters of areas at elevated risk, and quantify the impact of exposures, such as air pollution, on disease risk. I am working on all these related problems at present, and I have PhD projects in all these areas.

Generating deep fake left ventricles: a step towards personalised heart treatments (PhD)

Supervisors: Andrew Elliott, Vinny Davies, Hao Gao
Relevant research groups: Machine Learning and AIEmulation and Uncertainty QuantificationBiostatistics, Epidemiology and Health ApplicationsStatistical Modelling for Biology, Genetics and *omics

Personalised medicine is an exciting avenue in the field of cardiac healthcare where an understanding of patient-specific mechanisms can lead to improved treatments (Gao et al., 2017). The use of mathematical models to link the underlying properties of the heart with cardiac imaging offers the possibility of obtaining important parameters of heart function non-invasively (Gao et al., 2015). Unfortunately, current estimation methods rely on complex mathematical forward simulations, resulting in a solution taking hours, a time frame not suitable for real-time treatment decisions. To increase the applicability of these methods, statistical emulation methods have been proposed as an efficient way of estimating the parameters (Davies et al., 2019Noè et al., 2019). In this approach, simulations of the mathematical model are run in advance and then machine learning based methods are used to estimate the relationship between the cardiac imaging and the parameters of interest. These methods are, however, limited by our ability to understand the how cardiac geometry varies across patients which is in term limited by the amount of data available (Romaszko et al., 2019). In this project we will look at AI based methods for generating fake cardiac geometries which can be used to increase the amount of data (Qiao et al., 2023). We will explore different types of AI generation, including Generative Adversarial Networks or Variational Autoencoders, to understand how we can generate better 3D and 4D models of the fake left ventricles and create an improved emulation strategy that can make use of them.

Bayesian Mixture Models for Spatio-Temporal Data (PhD)

Supervisors: Craig Anderson
Relevant research groups: Modelling in Space and Time, Bayesian Modelling and Inference, Biostatistics, Epidemiology and Health Applications

The prevalence of disease is typically not constant across space – instead the risk tends to vary from one region to another.  Some of this variability may be down to environmental conditions, but many of them are driven by socio-economic differences between regions, with poorer regions tending to have worse health than wealthier regions.  For example, within the the Greater Glasgow and Clyde region, where the World Health Organisation noted that life expectancy ranges from 54 in Calton to 82 in Lenzie, despite these areas being less than 10 miles apart. There is substantial value to health professionals and policymakers in identifying some of the causes behind these localised health inequalities.

Disease mapping is a field of statistical epidemiology which focuses on estimating the patterns of disease risk across a geographical region. The main goal of such mapping is typically to identify regions of high disease risk so that relevant public health interventions can be made. This project involves the development of statistical models which will enhance our understanding regional differences in the risk of suffering from major diseases by focusing on these localised health inequalities.

Standard Bayesian hierarchical models with a conditional autoregressive prior are frequently used for risk estimation in this context, but these models assume a smooth risk surface which is often not appropriate in practice. In reality, it will often be the case that different regions have vastly different risk profiles and require different data generating functions as a result.

In this work we propose a mixture model based approach which allows different sub-populations to be represented by different underlying statistical distributions within a single modelling framework. By integrating CAR models into mixture models, researchers can simultaneously account for spatial dependencies and identify distinct disease patterns within subpopulations.

Implementing a biology-empowered statistical framework to detect rare varient risk factors for complex diseases in whole genome sequence cohorts (PhD)

Supervisors: Vincent Macaulay, Luísa Pereira (Geneticist, i3s)
Relevant research groups: Statistical Modelling for Biology, Genetics and *omicsBiostatistics, Epidemiology and Health Applications

The traditional genome-wide association studies to detect candidate genetic risk factors for complex diseases/phenotypes (GWAS) recur largely to the microarray technology, genotyping at once thousands or millions of variants regularly spaced across the genome. These microarrays include mostly common variants (minor allele frequency, MAF>5%), missing candidate rare variants which are the more likely to be deleterious [1]. Currently, the best strategy to genotype low-frequency (1%<MAF<5%) and rare (MAF<1%) variants is through next generation sequencing, and the increasingly availability of whole genome sequences (WGS) places us in the brink of detecting rare variants associated with complex diseases [2]. Statistically, this detection constitutes a challenge, as the massive number of rare variants in genomes (for example, 64.7M in 150 Iberian WGSs) would imply genotyping millions/billions of individuals to attain statistical power. In the last couple years, several statistical methods have being tested in the context of association of rare variants with complex traits [234], largely testing strategies to aggregate the rare variants. These works have not yet tested the statistical empowerment that can be gained by incorporating reliable biological evidence on the aggregation of rare variants in the most probable functional regions, such as non-coding regulatory regions that control the expression of genes [4]. In fact, it has been demonstrated that even for common candidate variants, most of these variants (around 88%; [5]) are located in non-coding regions. If this is true for the common variants detected by the traditional GWAS, it is highly probable to be also true for rare variants.

In this work, we will implement a biology-empowered statistical framework to detect rare variant risk factors for complex diseases in WGS cohorts. We will recur to the 200,000 WGSs from UK Biobank database [6], that will be available to scientists before the end of 2023. Access to clinical information of these >40 years old UK residents is also provided. We will build our framework around type-2 diabetes (T2D), a common complex disease for which thousands of common variant candidates have been found [7]. Also, the mapping of regulatory elements is well known for the pancreatic beta cells that play a leading role in T2D [8]. We will use this mapping in guiding the rare variants’ aggregation and test it against a random aggregation across the genome. Of course, the framework rationale will be appliable to any other complex disease. We will browse literature for aggregation methods available at the beginning of this work, but we already selected the method SKAT (sequence kernel association test; [3]) to be tested. SKAT fits a random-effects model to the set of variants within a genomic interval or biologically-meaningful region (such as a coding or regulatory region) and computes variant-set level p-values, while permitting correction for covariates (such as the principal components mentioned above that can account for population stratification between cases and controls).

 

Environmental, Ecological Sciences and Sustainability - Example Research Projects

Information about postgraduate research opportunities and how to apply can be found on the Postgraduate Research Study page. Below is a selection of projects that could be undertaken with our group.

There is also the possibility of applying to The Leverhulme Programme for Doctoral Training in Ecological Data Science which is hosted in our school. Information on how to apply can be found on the programme's application page.

Sampling strategies for environmental monitoring networks

Supervisors: Claire MillerCraig AlexanderCraig Wilkie
Relevant research groups: Modelling in Space and TimeEnvironmental, Ecological, and Sustainability

In recent years, there has been a lot of work done on investigating how to monitor environmental variables in the most efficient way.  Environmental variables, such as pollutants in water, can be monitored through, for example, in-situ sampling, automatic in-situ sensors or remote sensing.  However, each sampling approach has different levels of accuracy and is available at different spatial and temporal resolutions. 

Environmental regulators and industry all have a responsibility and commitment to monitoring environmental standards and mitigating the potential for increases to levels of pollutants.  At a time of world-wide budgetary pressures, the most efficient monitoring schemes are required.  However, the mechanisms of monitoring can also be detrimental to the environment e.g. through more visits to a site or lab/computer processing creating a higher environmental footprint.

The aim of the PhD is to extend work already carried out on the optimal design of monitoring networks for spatiotemporal models. Specifically, to identify the spatiotemporal sampling designs that can balance budgetary requirements and environmental impact, with a view to developing and enhancing online tools (e.g. GWSDAT) to provide automatic guidance to practitioners. The latter can then integrate this guidance into their assessment and development of the most optimal monitoring network. This will require statistical methodological development, working on computationally efficient implementations and software development.

The PhD will be jointly supervised by partners from industry and hence the successful candidate will additionally engage in knowledge exchange/transfer, training and networking in this sector.

New methods for analysis of migratory navigation (PhD)

Supervisors: Janine Illian
Relevant research groups: Modelling in Space and TimeBayesian Modelling and InferenceComputational StatisticsEnvironmental, Ecological Sciences and Sustainability

Joint project with Dr Urška Demšar (University of St Andrews)

Migratory birds travel annually across vast expanses of oceans and continents to reach their destination with incredible accuracy. How they are able to do this using only locally available cues is still not fully understood. Migratory navigation consists of two processes: birds either identify the direction in which to fly (compass orientation) or the location where they are at a specific moment in time (geographic positioning). One of the possible ways they do this is to use information from the Earth’s magnetic field in the so-called geomagnetic navigation (Mouritsen 2018). While there is substantial evidence (both physiological and behavioural) that they do sense magnetic field (Deutschlander and Beason 2014), we however still do not know exactly which of the components of the field they use for orientation or positioning. We also do not understand how rapid changes in the field affect movement behaviour.

There is a possibility that birds can sense these rapid large changes and that this may affect their navigational process. To study this, we need to link accurate data on Earth’s magnetic field with animal tracking data. This has only become possible very recently through new spatial data science advances:  we developed the MagGeo tool, which links contemporaneous geomagnetic data from Swarm satellites of the European Space Agency with animal tracking data (Benitez Paez et al. 2021).

Linking geomagnetic data to animal tracking data however creates a highly-dimensional data set, which is difficult to explore. Typical analyses of contextual environmental information in ecology include representing contextual variables as co-variates in relatively simple statistical models (Brum Bastos et al. 2021), but this is not sufficient for studying detailed navigational behaviour. This project will analyse complex spatio-temporal data using computationally efficient statistical model fitting approches in a Bayesian context.

This project is fully based on open data to support reproducibility and open science. We will test our new methods by annotating publicly available bird tracking data (e.g. from repositories such as Movebank.org), using the open MagGeo tool and implementing our new methods as Free and Open Source Software (R/Python).

References

Benitez Paez F, Brum Bastos VdS, Beggan CD, Long JA and Demšar U, 2021. Fusion of wildlife tracking and satellite geomagnetic data for the study of animal migration. Movement Ecology, 9:31. https://doi.org/10.1186/s40462-021-00268-4

Brum Bastos VdS, Łos M, Long JA, Nelson T and Demšar U, 2021, Context-aware movement analysis in ecology: a systematic review. International Journal of Geographic Information Sciencehttps://doi.org/10.1080/13658816.2021.1962528

Deutschlander ME and Beason RC, 2014. Avian navigation and geographic positioning. Journal of Field Ornithology, 85(2):111–133. https://doi.org/10.1111/jofo.12055

Integrated spatio-temporal modelling for environmental data (PhD)

Supervisors: Janine Illian
Relevant research groups: Modelling in Space and TimeBayesian Modelling and InferenceComputational StatisticsEnvironmental, Ecological Sciences and Sustainability

(Jointly supervised by Peter Henrys, CEH)

The last decade has seen a proliferation of environmental data with vast quantities of information available from various sources. This has been due to a number of different factors including: the advent of sensor technologies; the provision of remotely sensed data from both drones and satellites; and the explosion in citizen science initiatives. These data represent a step change in the resolution of available data across space and time - sensors can be streaming data at a resolution of seconds whereas citizen science observations can be in the hundreds of thousands.  

Over the same period, the resources available for traditional field surveys have decreased dramatically whilst logistical issues (such as access to sites, ) have increased. This has severely impacted the ability for field survey campaigns to collect data at high spatial and temporal resolutions. It is exactly this sort of information that is required to fit models that can quantify and predict the spread of invasive species, for example. 

Whilst we have seen an explosion of data across various sources, there is no single source that provides both the spatial and temporal intensity that may be required when fitting complex spatio-temporal models (cf invasive species example) - each has its own advantages and benefits in terms of information content. There is therefore potentially huge benefit in beginning together data from these different sources within a consistent framework to exploit the benefits each offers and to understand processes at unprecedented resolutions/scales that would be impossible to monitor. 

Current approaches to combining data in this way are typically very bespoke and involve complex model structures that are not reusable outside of the particular application area. What is needed is an overarching generic methodological framework and associated software solutions to implement such analyses. Not only would such a framework provide the methodological basis to enable researchers to benefit from this big data revolution, but also the capability to change such analyses from being stand alone research projects in their own right, to more operational, standard analytical routines. 

FInally, such dynamic, integrated analyses could feedback into data collection initiatives to ensure optimal allocation of effort for traditional surveys or optimal power management for sensor networks. The major step change being that this optimal allocation of effort is conditional on other data that is available. So, for example, given the coverage and intensity of the citizen science data, where should we optimally send our paid surveyors? The idea is that information is collected at times and locations that provide the greatest benefit in understanding the underpinning stochastic processes. These two major issues - integrated analyses and adaptive sampling - ensure that environmental monitoring is fit for purpose and scientists, policy and industry can benefit from the big data revolution. 

This project will develop an integrated statistical modelling strategy that provides a single modelling framework for enabling quantification of ecosystem goods and services while accounting for the fundamental differences in different data streams. Data collected at different spatial resolutions can be used within the same model through projecting it into continuous space and projecting it back into the landscape level of interest.  As a result, decisions can be made at the relevant spatial scale and uncertainty is propagated through, facilitating appropriate decision making. 

Statistical methodology for assessing the impacts of offshore renewable developments on marine wildlife (PhD)

Supervisors: Janine Illian
Relevant research groups: Modelling in Space and TimeBayesian Modelling and InferenceComputational StatisticsEnvironmental, Ecological Sciences and Sustainability

(jointly supervised by Esther Jones and Adam Butler, BIOSS)

Assessing the impacts of offshore renewable developments on marine wildlife is a critical component of the consenting process. A NERC-funded project, ECOWINGS, will provide a step-change in analysing predator-prey dynamics in the marine environment, collecting data across trophic levels against a backdrop of developing wind farms and climate change. Aerial survey and GPS data from multiple species of seabirds will be collected contemporaneously alongside prey data available over the whole water column from an automated surface vehicle and underwater drone.

These methods of data collection will generate 3D space and time profiles of predators and prey, creating a rich source of information and enormous potential for modelling and interrogation. The data present a unique opportunity for experimental design across a dynamic and changing marine ecosystem, which is heavily influenced by local and global anthropogenic activities. However, these data have complex intrinsic spatio-temporal properties, which are challenging to analyse. Significant statistical methods development could be achieved using this system as a case study, contributing to the scientific knowledge base not only in offshore renewables but more generally in the many circumstances where patchy ecological spatio-temporal data are available. 

This PhD project will develop spatio-temporal modelling methodology that will allow user to anaylse these exciting - and complex - data sets and help inform our knowledge on the impact of off-shore renewable on wildlife.

Imaging, Image Processing and Image Analysis - Example Research Projects

Information about postgraduate research opportunities and how to apply can be found on the Postgraduate Research Study page. Below is a selection of projects that could be undertaken with our group.

Medical image segmentation and uncertainty quantification (PhD)

Supervisors: Surajit Ray
Relevant research groups: Machine Learning and AIImaging, Image Processing and Image Analysis

This project focuses on the application of medical imaging and uncertainty quantification for the detection of tumours. The project aims to provide clinicians with accurate, non-invasive methods for detecting and classifying the presence of malignant and benign tumours. It seeks to combine advanced medical imaging technologies such as ultrasound, computed tomography (CT) and magnetic resonance imaging (MRI) with the latest artificial intelligence algorithms. These methods will automate the detection process and may be used for determining malignancy with a high degree of accuracy. Uncertainty quantification (UQ) techniques will help generate a more precise prediction for tumour malignancy by providing a characterisation of the degree of uncertainty associated with the diagnosis. The combination of medical imaging and UQ will significantly decrease the requirement for performing invasive medical procedures such as biopsies. This will improve the accuracy of the tumour detection process and reduce the duration of diagnosis. The project will also benefit from the development of novel image processing algorithms (e.g. deep learning) and machine learning models. These algorithms and models will help improve the accuracy of the tumour detection process and assist clinicians in making the best treatment decisions.

Analysis of spatially correlated functional data objects (PhD)

Supervisors: Surajit Ray
Relevant research groups: Modelling in Space and TimeComputational StatisticsNonparametric and Semi-parametric StatisticsImaging, Image Processing and Image Analysis

Historically, functional data analysis techniques have widely been used to analyze traditional time series data, albeit from a different perspective. Of late, FDA techniques are increasingly being used in domains such as environmental science, where the data are spatio-temporal in nature and hence is it typical to consider such data as functional data where the functions are correlated in time or space. An example where modeling the dependencies is crucial is in analyzing remotely sensed data observed over a number of years across the surface of the earth, where each year forms a single functional data object. One might be interested in decomposing the overall variation across space and time and attribute it to covariates of interest. Another interesting class of data with dependence structure consists of weather data on several variables collected from balloons where the domain of the functions is a vertical strip in the atmosphere, and the data are spatially correlated. One of the challenges in such type of data is the problem of missingness, to address which one needs develop appropriate spatial smoothing techniques for spatially dependent functional data. There are also interesting design of experiment issues, as well as questions of data calibration to account for the variability in sensing instruments. Inspite of the research initiative in analyzing dependent functional data there are several unresolved problems, which the student will work on:

  • robust statistical models for incorporating temporal and spatial dependencies in functional data
  • developing reliable prediction and interpolation techniques for dependent functional data
  • developing inferential framework for testing hypotheses related to simplified dependent structures
  • analysing sparsely observed functional data by borrowing information from neighbours
  • visualisation of data summaries associated with dependent functional data
  • Clustering of functional data

Generating deep fake left ventricles: a step towards personalised heart treatments (PhD)

Supervisors: Andrew Elliott, Vinny Davies, Hao Gao
Relevant research groups: Machine Learning and AIEmulation and Uncertainty QuantificationBiostatistics, Epidemiology and Health ApplicationsImaging, Image Processing and Image Analysis

Personalised medicine is an exciting avenue in the field of cardiac healthcare where an understanding of patient-specific mechanisms can lead to improved treatments (Gao et al., 2017). The use of mathematical models to link the underlying properties of the heart with cardiac imaging offers the possibility of obtaining important parameters of heart function non-invasively (Gao et al., 2015). Unfortunately, current estimation methods rely on complex mathematical forward simulations, resulting in a solution taking hours, a time frame not suitable for real-time treatment decisions. To increase the applicability of these methods, statistical emulation methods have been proposed as an efficient way of estimating the parameters (Davies et al., 2019Noè et al., 2019). In this approach, simulations of the mathematical model are run in advance and then machine learning based methods are used to estimate the relationship between the cardiac imaging and the parameters of interest. These methods are, however, limited by our ability to understand the how cardiac geometry varies across patients which is in term limited by the amount of data available (Romaszko et al., 2019). In this project we will look at AI based methods for generating fake cardiac geometries which can be used to increase the amount of data (Qiao et al., 2023). We will explore different types of AI generation, including Generative Adversarial Networks or Variational Autoencoders, to understand how we can generate better 3D and 4D models of the fake left ventricles and create an improved emulation strategy that can make use of them.

Statistics in Chemistry/Physics - Example Research Projects

Information about postgraduate research opportunities and how to apply can be found on the Postgraduate Research Study page. Projects will appear below here when they become available.

Statistical Modelling for Biology, Genetics and *omics - Example Research Projects

Information about postgraduate research opportunities and how to apply can be found on the Postgraduate Research Study page. Below is a selection of projects that could be undertaken with our group.

Modelling genetic variation (MSc/PhD)

Supervisors: Vincent Macaulay
Relevant research groups: Bayesian Modelling and InferenceStatistical Modelling for Biology, Genetics and *omics

Variation in the distribution of different DNA sequences across individuals has been shaped by many processes which can be modelled probabilistically, processes such as demographic factors like prehistoric population movements, or natural selection. This project involves developing new techniques for teasing out 

Modality of mixtures of distributions (PhD)

Supervisors: Surajit Ray
Relevant research groups: Nonparametric and Semi-parametric StatisticsApplied Probability and Stochastic ProcessesStatistical Modelling for Biology, Genetics and *omicsBiostatistics, Epidemiology and Health Applications

Finite mixtures provide a flexible and powerful tool for fitting univariate and multivariate distributions that cannot be captured by standard statistical distributions. In particular, multivariate mixtures have been widely used to perform modeling and cluster analysis of high-dimensional data in a wide range of applications. Modes of mixture densities have been used with great success for organizing mixture components into homogenous groups. But the results are limited to normal mixtures. Beyond the clustering application existing research in this area has provided fundamental results regarding the upper bound of the number of modes, but they too are limited to normal mixtures. In this project, we wish to explore the modality of non-normal distributions and their application to real life problems

Implementing a biology-empowered statistical framework to detect rare varient risk factors for complex diseases in whole genome sequence cohorts (PhD)

Supervisors: Vincent Macaulay, Luísa Pereira (Geneticist, i3s)
Relevant research groups: Statistical Modelling for Biology, Genetics and *omicsBiostatistics, Epidemiology and Health Applications

The traditional genome-wide association studies to detect candidate genetic risk factors for complex diseases/phenotypes (GWAS) recur largely to the microarray technology, genotyping at once thousands or millions of variants regularly spaced across the genome. These microarrays include mostly common variants (minor allele frequency, MAF>5%), missing candidate rare variants which are the more likely to be deleterious [1]. Currently, the best strategy to genotype low-frequency (1%<MAF<5%) and rare (MAF<1%) variants is through next generation sequencing, and the increasingly availability of whole genome sequences (WGS) places us in the brink of detecting rare variants associated with complex diseases [2]. Statistically, this detection constitutes a challenge, as the massive number of rare variants in genomes (for example, 64.7M in 150 Iberian WGSs) would imply genotyping millions/billions of individuals to attain statistical power. In the last couple years, several statistical methods have being tested in the context of association of rare variants with complex traits [2, 3, 4], largely testing strategies to aggregate the rare variants. These works have not yet tested the statistical empowerment that can be gained by incorporating reliable biological evidence on the aggregation of rare variants in the most probable functional regions, such as non-coding regulatory regions that control the expression of genes [4]. In fact, it has been demonstrated that even for common candidate variants, most of these variants (around 88%; [5]) are located in non-coding regions. If this is true for the common variants detected by the traditional GWAS, it is highly probable to be also true for rare variants.

In this work, we will implement a biology-empowered statistical framework to detect rare variant risk factors for complex diseases in WGS cohorts. We will recur to the 200,000 WGSs from UK Biobank database [6], that will be available to scientists before the end of 2023. Access to clinical information of these >40 years old UK residents is also provided. We will build our framework around type-2 diabetes (T2D), a common complex disease for which thousands of common variant candidates have been found [7]. Also, the mapping of regulatory elements is well known for the pancreatic beta cells that play a leading role in T2D [8]. We will use this mapping in guiding the rare variants’ aggregation and test it against a random aggregation across the genome. Of course, the framework rationale will be appliable to any other complex disease. We will browse literature for aggregation methods available at the beginning of this work, but we already selected the method SKAT (sequence kernel association test; [3]) to be tested. SKAT fits a random-effects model to the set of variants within a genomic interval or biologically-meaningful region (such as a coding or regulatory region) and computes variant-set level p-values, while permitting correction for covariates (such as the principal components mentioned above that can account for population stratification between cases and controls).

Generating deep fake left ventricles: a step towards personalised heart treatments (PhD)

Supervisors: Andrew ElliottVinny DaviesHao Gao
Relevant research groups: Machine Learning and AIEmulation and Uncertainty QuantificationBiostatistics, Epidemiology and Health ApplicationsImaging, Image Processing and Image Analysis

Personalised medicine is an exciting avenue in the field of cardiac healthcare where an understanding of patient-specific mechanisms can lead to improved treatments (Gao et al., 2017). The use of mathematical models to link the underlying properties of the heart with cardiac imaging offers the possibility of obtaining important parameters of heart function non-invasively (Gao et al., 2015). Unfortunately, current estimation methods rely on complex mathematical forward simulations, resulting in a solution taking hours, a time frame not suitable for real-time treatment decisions. To increase the applicability of these methods, statistical emulation methods have been proposed as an efficient way of estimating the parameters (Davies et al., 2019Noè et al., 2019). In this approach, simulations of the mathematical model are run in advance and then machine learning based methods are used to estimate the relationship between the cardiac imaging and the parameters of interest. These methods are, however, limited by our ability to understand the how cardiac geometry varies across patients which is in term limited by the amount of data available (Romaszko et al., 2019). In this project we will look at AI based methods for generating fake cardiac geometries which can be used to increase the amount of data (Qiao et al., 2023). We will explore different types of AI generation, including Generative Adversarial Networks or Variational Autoencoders, to understand how we can generate better 3D and 4D models of the fake left ventricles and create an improved emulation strategy that can make use of them.

Social and Urban Studies - Example Research Projects

Information about postgraduate research opportunities and how to apply can be found on the Postgraduate Research Study page. Below is a selection of projects that could be undertaken with our group.

Causal inference in noisy social networks (PhD)

Supervisors: Vanessa McNealis
Relevant research groups: Nonparametric and Semi-parametric StatisticsBiostatistics, Epidemiology and Health ApplicationsSocial and Urban Studies

One core task of science is causal inference, yet distinguishing causality from spurious associations in observational data can be challenging. Statistical causal inference provides a framework to define causal effects, specify assumptions for identifying causal effects, and assess sensitivity of causal estimators to these assumptions.

Recent interest has focused on causal inference under interference (or spillover), where one individual’s treatment affects the outcomes of others. Social network data are particularly valuable for this purpose, as they offer information about connections between individuals, revealing potential pathways for interference. For instance, in the National Longitudinal Study of Adolescent Health (Add Health), peer influences among adolescents provide an ideal case for studying spillover, especially as they relate to behavioural and academic outcomes. However, one challenge for Add Health is the very high level of missing edge variable data and censoring present, posing challenges since many methods for evaluating spillover effects assume fully observed networks.

This PhD will develop statistical methods for causal inference under network interference with noise, considering the following issues/approaches:

  • Bias characterization in the presence of missing or uncertain edge information
  • Semi-parametric inference
  • Propensity score methods
  • Multiple imputation for network data

A good knowledge of methods for survey sampling and regression is essential, familiarity with causal inference, statistical methods for coarse data, and semi-parametric inference would be an advantage.

Innovation in Learning and Teaching

Statistics and Data Analytics Education - Example Research Projects

Our group has an active PhD student community, and every year we admit new PhD students. We welcome applications from across the world. Further information can be found here.