Statistics

Applications for 2025-2026 open on 1 July 2025

Beyond Prediction: Data Science summer projects

Project code: SCI057

Supervisor(s):

Assoc Prof Lara Greaves (Ngāpuhi)

Prof Mark Gahegan (Pākehā)

Eric Marshall (Ngāpuhi)

Tori Diamond (Ngāpuhi)

Assoc Prof Phil Wilcox (Ngāti Kahungunu, Rongomaiwahine, Ngāti Rakaipaaka

Otago University, Assoc Prof Andrew Sporle (Ngāti Apa, Rangitāne, Te Rarawa)

Discipline(s):

Ngā Motu Whakahī

School of Computer Science

Statistics

Project

Up to five students

This scholarship will be hosted by Ngā Motu Whakahī and is specifically targeted toward students who whakapapa Māori.

We are offering up to five summer projects for tauira Māori, through Beyond Prediction, an MBIE-funded data science platform.

The role

Students will be matched with a supervisor and topic based on their skills and interests. Past students have explored population data, epidemiology and health data, statistics and maths education, Māori data sovereignty, AI, data visualisation, and applying tikanga and mātaraunga to data science.

Ideal student

We are particularly interested in students with coding skills. Please get in touch with Lara to discuss topics/supervision.

The cascading impact of poor-quality Census data on official health and social statistics for Pacific populations in Aotearoa

Project code: SCI058

Supervisor(s):

Assoc Prof Lara Greaves (Ngāpuhi)

Otago University, Assoc Prof Andrew Sporle (Ngāti Apa, Rangitāne, Te Rarawa)

Discipline(s):

Ngā Motu Whakahī

Statistics

Project

This is a Ngā Motu Whakahī scholarship specifically targeted toward students who are of Pacific heritage.

The role
This project explores the impact of poor Census coverage of the various Pacific populations on the accuracy of Pacific health and social statistics. A combination of Census data, health data and research publications will be used to explore the possible extent of the uncertainty in key official population measures at the national and regional level.

Ideal student

The student should have some familiarity with official statistics and some understanding of Pacific population statistics

Connecting Community and Retention in STEM through the Tuākana Program

Project code: SCI063

Supervisor(s):

Kannan Ridings

Benjamin Pollard

Susan Wingfield

Malia Puloka

Discipline(s):

Ngā Motu Whakahī

Physics

Statistics

Project

This is a Ngā Motu Whakahī scholarship specifically targeted toward students who whakapapa Māori or are of Pacific heritage.

The role
We will develop a survey for participants in the Tuākana program in Physics and Stats (possibly also Maths) at the University of Auckland Waipapa Taumata Rau. The survey, consisting of mostly open-ended items, will help to understand previous research that suggested a connection between retention in physics and participation in Tuākana. The survey will invite descriptions of whakawhanaungatanga as it relates to persisting through a STEM degree. The survey will be sent to Tuākana participants after this summer project.

How much homework is too much?

Project code: SCI126

Supervisor(s):

Beatrix Jones

Tanya Evans

Discipline(s):

Statistics

Project

The TIMSS study collects information about school math and science achievement around the world, alongside a wealth of information about students, schools, and classroom practices. A recent analysis of the Irish TIMSS data suggests short daily homework is best for mathematics achievement.

The role

We will repeat this analysis for the NZ data.

Requirements

You will need coding and statistical skills. An interest in Math Education or Bayesian Statistics would be helpful.

Nested Sampling vs. Hamiltonian MCMC

Project code: SCI127

Supervisor(s):

Brendon Brewer 

Matt Edwards

Discipline(s):

Statistics

Project

Nested Sampling and Hamiltonian Markov Chain Monte Carlo (MCMC) are considered among the most powerful techniques for Bayesian computation. Despite approaching things from very different angles, both are able to handle challenging shapes in the posterior distribution that other methods cannot. However, it is not clear whether the set of solvable problems is the same size for both methods.

The role

In this project, you will construct challenging high-dimensional probability distributions and test both methods on them. The goal is to clarify what kinds of problems each method works better on, and to find and bring to light problems where only one of the two approaches works. This could lead to useful practical advice for practitioners who need to decide what methods to use.

Requirements
This project requires good grades in STATS 331 and very good programming skills.

Best practice regression modelling with iNZight

Project code: SCI128

Supervisor(s):

Dr Tom Elliott

Dr Matt Edwards

Discipline(s):

Statistics

Project

Regression modelling involves a set of basic assumptions that require checking. Unchecked or unsatisfied assumptions should be handled, or at least commented on in text and graphical outputs.

The role

Working with the iNZight development team, you will devleop R software for fitting regression models with interactive diagnostic and assumption checking. 

Requirements

STATS 380, STATS 330, and very good R skills.

While the project will not involve developing the GUI directly, some familiarity with either version of iNZight is expected (STATS 10x).

Bayesian inference with iNZight

Project code: SCI129

Supervisor(s):

Dr Tom Elliott

Dr Matt Edwards

Discipline(s):

Statistics

Project

iNZight currently provides Normal theory and bootstrap methods for inference and hypothesis testing.

The role

Working with the iNZight development team, you will:

  • Perform a short literature/course review of Bayesian approaches to common inference methods
  • Design a simple framework for guiding users through a Bayesian inference problem
  • Complete an R implementation of at least one or two methods to demonstrate the framework.


Requirements

STATS 380, STATS 331, and very good R skills. Familiarity with iNZight (i.e., STATS 10x).

Developing smarter ways of marking short answer questions in large enrolment statistics and data science courses

Project code: SCI130

Supervisor(s):

Anna Fergusson

Discipline(s):

Statistics

Project

This project involves exploring text response data from large introductory statistics and data science courses to understand how and what features of the answers are related to higher and lower levels of statistical, computational, and creative thinking.

The role

Understanding these features will enable the development of computational approaches and marking applications/tools that can help human graders to organise similar answers and inform teaching practice.

Requirements

STATS 220 and STATS 380, and a strong interest in data science and statistics education.

Adapting the saddlepoint method for discrete random variables

Project code: SCI131

Supervisor(s):

Jesse Goodman

Discipline(s):

Statistics

Project

The saddlepoint approximation is a systematic method for approximating an unknown density function in terms of a known moment generating function. It is useful when each individual in a large population contributes to a single random variable, and has often been used in statistical ecology.

The saddlepoint approximation works best for densities, when the underlying random variable is continuous. For discrete random variables, the traditional saddlepoint approximation works less well, and always fails at the boundary.

The role

This project will develop tools for automating saddlepoint techniques in terms probability generating functions, extending a code framework currently based around moment generating functions.

Ideal student

Experience with R programming and simulation would be a plus. The mathematical aspects of the saddlepoint approximation are not prerequisites, but mathematical applications could be explored as part of the project depending on the student. This project will also involve some of the optimizations needed for high-accuracy scientific programming and numerical calculation.

Estimating the Local Burden of Disease of Motor Neuron Disease

Project code: SCI132

Supervisor(s):

Dr Priya Parmar

Discipline(s):

Statistics

Project

We will use the IDI to examine the national burden of Motor Neuron Disease estimating:

(a) Fundamental epidemiological estimates of prevalence, incidence and mortality and compare back to 2018 estimates
(b) Advanced epidemiological measures of Diseases-adjusted life years (DALYs), Quality-adjusted life years (QALYs), Years lived with disability (YLD) and Years of lives lost (YLLs)
(c) Create a risk profile for those with Motor Neuron Disease

Requirements

Applied statistics using R (STATS20X) and MySQL. Some understanding of epidemiology (POPLHLTH 708) would be useful.

Estimating the Local Burden of Disease of Multiple Sclerosis

Project code: SCI133

Supervisor(s):

Dr Priya Parmar

Discipline(s):

Statistics

Project

We will use the IDI to examine the national burden of Multiple Sclerosis, estimating:

(a) Fundamental epidemiological estimates of prevalence, incidence and mortality and compare back to 2018 estimates
(b) Advanced epidemiological measures of Diseases-adjusted life years (DALYs), Quality-adjusted life years (QALYs), Years lived with disability (YLD) and Years of lives lost (YLLs)
(c) Create a risk profile for those with Motor Neuron Disease

Requirements

Applied statistics using R (STATS20X) and MySQL. Some understanding of epidemiology (POPLHLTH 708) would be useful.

Modelling intra-ethnic heterogeneity and multiple ethnicities in health and social outcomes

Project code: SCI134

Supervisor(s):

Andrew Sporle

Discipline(s):

Statistics

Project

New Zealand’s population is increasingly ethnically diverse, with multiple ethnic identities becoming more common in younger populations.

The role

This project explores ways to include more heterogenous and overlapping ethnic identities in statistical modelling and analysis of ethnic specific health and social outcomes. This involves documenting the limitations of existing approaches and investigating methods for including ethnic intersectionality in analysis and modelling.

Requirements

Some familiarity with official statistics, some understanding of ethnic population statistics and some statistical modelling experience are required.

Population projection methods for dynamic population structures

Project code: SCI135

Supervisor(s):

Andrew Sporle

Discipline(s):

Statistics

Project

Population projections are a routine part of official population statistics in New Zealand, with annual national projections produced after every Census. However, where the contributors to population change are highly variable, projections are produced for 5-year intervals or even not at all.

The role

This project explores options for creating projections for dynamic sub-populations in NZ that can include explicit levels of uncertainty in the factors affecting change.

Requirements

Some familiarity with official statistics, and some statistical modelling experience are required.

Statistical modelling for veterinary science studies

Project code: SCI136

Supervisor(s):

Ben Stevenson

Hamish Baron (Unusual Pet Vets, Australia)

Discipline(s):

Statistics

Project

This project involves statistical modelling for a veterinary science research study, potentially on destructive feather behaviour of nestlings by orange-bellied parrot parents. The exact study is not known at the time of writing this description, but further details will be available leading up to the application deadline. Please get in touch if you are interested.

Skills
Regardless of which study is selected, this is an opportunity for a student to apply the statistical skills they've acquired during their degree to a real-world research problem. This project is likely to contribute to a research publication.

Requirements

  • Excellent grades in data analysis courses, such as STATS 201/208 and STATS 330.
  • Excellent R programming skills.
  • Excellent grades in statistical theory courses, such as 210 or 310


Desirable, but not required

  • Excellent grades in statistical theory courses, such as STATS 210 or 310.
  • Previous experience with methods such as (generalised) linear models, (generalised) linear mixed-effects models, and survival analysis.

Benchmarking Polygenic Risk Scores Across Diverse Cohorts and Disease Traits

Project code: SCI137

Supervisor(s):

Yalu Wen

Discipline(s):

Statistics

Project

Polygenic risk scores (PRS) are increasingly used to quantify genetic susceptibility to complex diseases by aggregating the effects of multiple genetic variants. Despite rapid advances in PRS methodology, there remains a significant gap in understanding how different PRS construction and validation strategies perform across diverse populations, disease domains, and data types.

The role

This benchmarking project aims to systematically evaluate and compare the performance of existing PRS methods using harmonized pipelines across multiple biobanks and datasets.