Statistics
Applications for 2025-2026 open on 1 July 2025
Beyond Prediction: Data Science summer projects
Project code: SCI057
Supervisor(s):
Assoc Prof Lara Greaves (Ngāpuhi)
Prof Mark Gahegan (Pākehā)
Eric Marshall (Ngāpuhi)
Tori Diamond (Ngāpuhi)
Assoc Prof Phil Wilcox (Ngāti Kahungunu, Rongomaiwahine, Ngāti Rakaipaaka
Otago University, Assoc Prof Andrew Sporle (Ngāti Apa, Rangitāne, Te Rarawa)
Discipline(s):
Ngā Motu Whakahī
School of Computer Science
Statistics
Project
Up to five students
This scholarship will be hosted by Ngā Motu Whakahī and is specifically targeted toward students who whakapapa Māori.
We are offering up to five summer projects for tauira Māori, through Beyond Prediction, an MBIE-funded data science platform.
The role
Students will be matched with a supervisor and topic based on their skills and interests. Past students have explored population data, epidemiology and health data, statistics and maths education, Māori data sovereignty, AI, data visualisation, and applying tikanga and mātaraunga to data science.
Ideal student
We are particularly interested in students with coding skills. Please get in touch with Lara to discuss topics/supervision.
The cascading impact of poor-quality Census data on official health and social statistics for Pacific populations in Aotearoa
Project code: SCI058
Supervisor(s):
Assoc Prof Lara Greaves (Ngāpuhi)
Otago University, Assoc Prof Andrew Sporle (Ngāti Apa, Rangitāne, Te Rarawa)
Discipline(s):
Ngā Motu Whakahī
Statistics
Project
This is a Ngā Motu Whakahī scholarship specifically targeted toward students who are of Pacific heritage.
The role
This project explores the impact of poor Census coverage of the various Pacific populations on the accuracy of Pacific health and social statistics. A combination of Census data, health data and research publications will be used to explore the possible extent of the uncertainty in key official population measures at the national and regional level.
Ideal student
The student should have some familiarity with official statistics and some understanding of Pacific population statistics
Connecting Community and Retention in STEM through the Tuākana Program
Project code: SCI063
Supervisor(s):
Discipline(s):
Ngā Motu Whakahī
Physics
Statistics
Project
This is a Ngā Motu Whakahī scholarship specifically targeted toward students who whakapapa Māori or are of Pacific heritage.
The role
We will develop a survey for participants in the Tuākana program in Physics and Stats (possibly also Maths) at the University of Auckland Waipapa Taumata Rau. The survey, consisting of mostly open-ended items, will help to understand previous research that suggested a connection between retention in physics and participation in Tuākana. The survey will invite descriptions of whakawhanaungatanga as it relates to persisting through a STEM degree. The survey will be sent to Tuākana participants after this summer project.
How much homework is too much?
Project code: SCI126
Supervisor(s):
Tanya Evans
Discipline(s):
Statistics
Project
The TIMSS study collects information about school math and science achievement around the world, alongside a wealth of information about students, schools, and classroom practices. A recent analysis of the Irish TIMSS data suggests short daily homework is best for mathematics achievement.
The role
We will repeat this analysis for the NZ data.
Requirements
You will need coding and statistical skills. An interest in Math Education or Bayesian Statistics would be helpful.
Nested Sampling vs. Hamiltonian MCMC
Project code: SCI127
Supervisor(s):
Discipline(s):
Statistics
Project
Nested Sampling and Hamiltonian Markov Chain Monte Carlo (MCMC) are considered among the most powerful techniques for Bayesian computation. Despite approaching things from very different angles, both are able to handle challenging shapes in the posterior distribution that other methods cannot. However, it is not clear whether the set of solvable problems is the same size for both methods.
The role
In this project, you will construct challenging high-dimensional probability distributions and test both methods on them. The goal is to clarify what kinds of problems each method works better on, and to find and bring to light problems where only one of the two approaches works. This could lead to useful practical advice for practitioners who need to decide what methods to use.
Requirements
This project requires good grades in STATS 331 and very good programming skills.
Best practice regression modelling with iNZight
Project code: SCI128
Supervisor(s):
Discipline(s):
Statistics
Project
Regression modelling involves a set of basic assumptions that require checking. Unchecked or unsatisfied assumptions should be handled, or at least commented on in text and graphical outputs.
The role
Working with the iNZight development team, you will devleop R software for fitting regression models with interactive diagnostic and assumption checking.
Requirements
STATS 380, STATS 330, and very good R skills.
While the project will not involve developing the GUI directly, some familiarity with either version of iNZight is expected (STATS 10x).
Bayesian inference with iNZight
Project code: SCI129
Supervisor(s):
Discipline(s):
Statistics
Project
iNZight currently provides Normal theory and bootstrap methods for inference and hypothesis testing.
The role
Working with the iNZight development team, you will:
- Perform a short literature/course review of Bayesian approaches to common inference methods
- Design a simple framework for guiding users through a Bayesian inference problem
- Complete an R implementation of at least one or two methods to demonstrate the framework.
Requirements
STATS 380, STATS 331, and very good R skills. Familiarity with iNZight (i.e., STATS 10x).
Developing smarter ways of marking short answer questions in large enrolment statistics and data science courses
Project code: SCI130
Supervisor(s):
Discipline(s):
Statistics
Project
This project involves exploring text response data from large introductory statistics and data science courses to understand how and what features of the answers are related to higher and lower levels of statistical, computational, and creative thinking.
The role
Understanding these features will enable the development of computational approaches and marking applications/tools that can help human graders to organise similar answers and inform teaching practice.
Requirements
STATS 220 and STATS 380, and a strong interest in data science and statistics education.
Adapting the saddlepoint method for discrete random variables
Project code: SCI131
Supervisor(s):
Discipline(s):
Statistics
Project
The saddlepoint approximation is a systematic method for approximating an unknown density function in terms of a known moment generating function. It is useful when each individual in a large population contributes to a single random variable, and has often been used in statistical ecology.
The saddlepoint approximation works best for densities, when the underlying random variable is continuous. For discrete random variables, the traditional saddlepoint approximation works less well, and always fails at the boundary.
The role
This project will develop tools for automating saddlepoint techniques in terms probability generating functions, extending a code framework currently based around moment generating functions.
Ideal student
Experience with R programming and simulation would be a plus. The mathematical aspects of the saddlepoint approximation are not prerequisites, but mathematical applications could be explored as part of the project depending on the student. This project will also involve some of the optimizations needed for high-accuracy scientific programming and numerical calculation.
Estimating the Local Burden of Disease of Motor Neuron Disease
Project code: SCI132
Supervisor(s):
Discipline(s):
Statistics
Project
We will use the IDI to examine the national burden of Motor Neuron Disease estimating:
(a) Fundamental epidemiological estimates of prevalence, incidence and mortality and compare back to 2018 estimates
(b) Advanced epidemiological measures of Diseases-adjusted life years (DALYs), Quality-adjusted life years (QALYs), Years lived with disability (YLD) and Years of lives lost (YLLs)
(c) Create a risk profile for those with Motor Neuron Disease
Requirements
Applied statistics using R (STATS20X) and MySQL. Some understanding of epidemiology (POPLHLTH 708) would be useful.
Estimating the Local Burden of Disease of Multiple Sclerosis
Project code: SCI133
Supervisor(s):
Discipline(s):
Statistics
Project
We will use the IDI to examine the national burden of Multiple Sclerosis, estimating:
(a) Fundamental epidemiological estimates of prevalence, incidence and mortality and compare back to 2018 estimates
(b) Advanced epidemiological measures of Diseases-adjusted life years (DALYs), Quality-adjusted life years (QALYs), Years lived with disability (YLD) and Years of lives lost (YLLs)
(c) Create a risk profile for those with Motor Neuron Disease
Requirements
Applied statistics using R (STATS20X) and MySQL. Some understanding of epidemiology (POPLHLTH 708) would be useful.
Modelling intra-ethnic heterogeneity and multiple ethnicities in health and social outcomes
Project code: SCI134
Supervisor(s):
Discipline(s):
Statistics
Project
New Zealand’s population is increasingly ethnically diverse, with multiple ethnic identities becoming more common in younger populations.
The role
This project explores ways to include more heterogenous and overlapping ethnic identities in statistical modelling and analysis of ethnic specific health and social outcomes. This involves documenting the limitations of existing approaches and investigating methods for including ethnic intersectionality in analysis and modelling.
Requirements
Some familiarity with official statistics, some understanding of ethnic population statistics and some statistical modelling experience are required.
Population projection methods for dynamic population structures
Project code: SCI135
Supervisor(s):
Discipline(s):
Statistics
Project
Population projections are a routine part of official population statistics in New Zealand, with annual national projections produced after every Census. However, where the contributors to population change are highly variable, projections are produced for 5-year intervals or even not at all.
The role
This project explores options for creating projections for dynamic sub-populations in NZ that can include explicit levels of uncertainty in the factors affecting change.
Requirements
Some familiarity with official statistics, and some statistical modelling experience are required.
Statistical modelling for veterinary science studies
Project code: SCI136
Supervisor(s):
Hamish Baron (Unusual Pet Vets, Australia)
Discipline(s):
Statistics
Project
This project involves statistical modelling for a veterinary science research study, potentially on destructive feather behaviour of nestlings by orange-bellied parrot parents. The exact study is not known at the time of writing this description, but further details will be available leading up to the application deadline. Please get in touch if you are interested.
Skills
Regardless of which study is selected, this is an opportunity for a student to apply the statistical skills they've acquired during their degree to a real-world research problem. This project is likely to contribute to a research publication.
Requirements
- Excellent grades in data analysis courses, such as STATS 201/208 and STATS 330.
- Excellent R programming skills.
- Excellent grades in statistical theory courses, such as 210 or 310
Desirable, but not required
- Excellent grades in statistical theory courses, such as STATS 210 or 310.
- Previous experience with methods such as (generalised) linear models, (generalised) linear mixed-effects models, and survival analysis.
Benchmarking Polygenic Risk Scores Across Diverse Cohorts and Disease Traits
Project code: SCI137
Supervisor(s):
Discipline(s):
Statistics
Project
Polygenic risk scores (PRS) are increasingly used to quantify genetic susceptibility to complex diseases by aggregating the effects of multiple genetic variants. Despite rapid advances in PRS methodology, there remains a significant gap in understanding how different PRS construction and validation strategies perform across diverse populations, disease domains, and data types.
The role
This benchmarking project aims to systematically evaluate and compare the performance of existing PRS methods using harmonized pipelines across multiple biobanks and datasets.