# Statistics

Applications for 2023-2024 are now closed.

## Theoretical foundations of machine learning

Supervisors

Discipline

School of Computer Science

Mathematics

Statistics

Project code: SCI078

### Project

Machine learning and, more broadly, artificial intelligence will continue to change everyone’s life profoundly. Mathematics, statistics, and computer science play an important role in advancing machine learning algorithms (e.g., to make algorithms more reliable) and theoretical research into machine learning can advance our understanding into why certain methods are successful or not.

In this project, you will study some theoretical foundations of machine learning algorithms and how techniques from probability theory, geometry, and graph theory can be leveraged to aid the design of machine learning algorithms.

The exact direction of the project will be decided at its start and depend on the interests and experience of the summer student.

## Extending JAGS to the include Zeta distribution

Supervisor

Discipline

Statistics

Project code: SCI175

### Project

The project supervisor is interested in extending JAGS to be able to sample from the zeta distribution. In theory this is straightforward, however, to do so requires a stable implementation of the Riemann-Zeta function.

This project requires:

• A good knowledge of C++ and compilation of C++
• Some Bayesian statistics – STATS 331 would be a good place to start
• R Programming
• Some understanding of random number generation.

This is a project for someone who really likes to programme, ideally is at ease with more than one operating system (e.g. Windows AND Linux or OS X) at the command line.

## Classifying the reproductive state of pest snails

Supervisor

Discipline

Statistics

Project code: SCI176

### Project

Contamination of crops, such as barley and wheat, by invasive snail species is a constant threat to the Australian cereal crops export market. It results in the devaluation of crops at receival and/or blocking of market access. Control of this crop pest is, therefore, of critical economic importance.

Snail population control requires a combination of cultural, biological, and chemical (baiting) methods. Optimal efficacy of baits is achieved when their application coincides with the set of environmental conditions which trigger reproductive activity in adult snails, usually occurring around mid- to late-autumn in southern Australia.

The length of an adult snail’s albumen gland, which enlarges in preparation for reproduction, is considered an indicator of its reproductive state (i.e. active or inactive). Our data shows that the albumen gland lengths among the individuals collected in the monthly samples for our study vary considerably, showing that the population does not become reproductively active (and return to being inactive) in unison, but in waves across time. This is an important finding since, in order to identify the environmental variables predictive of reproductive state, we must first be able to label individuals as being in reproductive or non-reproductive state.

In this project the student will:

• Apply unsupervised classification methods to identify reproductively and inactive snails across time
• Explore the availability and use of R packages for performing unsupervised classification
• Time permitting, explore the use of resampling methods to estimate clustering validity

Preferred skills: A student who is a confident R user and who has achieved at least a B+ grade in both STATS 330 and STATS 369. Successful completion of either STATS 240 or STATS 340 is preferred but not essential.

## Bioacoustics for animal density estimation

Supervisor

Discipline

Statistics

Project code: SCI177

### Project

Some animals are difficult to see or catch but are very easy to hear. Wildlife populations of such species are now routinely monitored using passive acoustics: a researcher deploys devices such as microphones or hydrophones into a survey area and collects recordings of animal vocalisations.

The ultimate goal for some of these studies is to estimate animal density from the acoustic data. However, converting sound files to animal density estimates can be a long and complicated procedure involving study design, signal processing techniques, machine learning methods, expert input, and statistical modelling.

The broad aim of this project is to help make one or more of these steps easier for researchers conducting passive acoustic surveys. The specific goals can be tailored to the skills and interests of the student.

Experience with the following topics would be useful, but students without such experience should not feel dissuaded from applying:

• Creating Shiny apps
• R programming
• Science communication
• Signal processing
• Machine learning (e.g., neural networks)

Please feel free to get in touch if you would like to discuss any details with me in advance.

## Modelling morphometric data of animal populations collected by drones

Supervisor

Discipline

Statistics

Project code: SCI178

### Project

Measuring the size of animals by hand can be difficult because they might be big and scary (e.g., grizzly bears), react negatively to handling (e.g., manta rays), difficult to access (e.g., mountain goats), or too large for your tape measure (e.g., blue whales). We can avoid these problems by flying drones equipped with cameras over animals, but this method introduces the issue of measurement error: when we measure the size of an animal using a drone we don't get the answer quite right.

Myself and collaborators have developed a method to estimate population-level distributions of morphometric measurements in a way that accommodates drone measurement error. This project involves extending our existing model, and developing a user-friendly R implementation.

Requirements:

• Excellent programming skills
• A solid understanding of statistical theory, such as the content in STATS 310
• Previous experience with R
• Either previous experience with C++, or a willingness to learn.

## Validating the Use of THz Spectroscopy to Measure the Water Content in Ryegrass

Supervisor

Discipline

Statistics

Project code: SCI179

### Project

Ryegrass is the predominant outdoor feed for grazing animals in Aotearoa New Zealand. Being able to estimate the water content of ryegrass can help farmers assess the state of their grazing fields, and plant breeders develop drought resistant ryegrass cultivars. Ryegrass is almost completely transparent in the Terahertz (THz) domain apart from the water within it. This, coupled with its non-destructive nature, makes THz radiation a much more attractive option than the traditional “cutting and weighing’” method of estimating water content.

The overall aim of this project is to demonstrate the feasibility of measuring water content in ryegrass using THz spectroscopy.

The project is experimental; you will learn how to design an experiment to assess the feasibility, accuracy and reliability of using THz spectroscopy to estimate water content, and how to analyse the resulting data (in Python and/or R). In addition, you will get real-world experience of data collection in experimental research, including how to operate a state-of-the-art THz spectrometer.

An interest in experimental research and applied statistics, including experimental design, data collection and data analysis is essential. However, no previous knowledge of THz technology is required. Experience in Python and/or R would be advantageous.

## What do you do when the data is unreliable?

Supervisor

Chaitanya Joshi

Discipline

Statistics

Project code: SCI180

### Project

Today we live in the age of data and critical decisions are often made based on the insights generated from modelling the data. However, uncertainty in the data can pose challenges in several important applications. For example, crimes such as family violence are notoriously under-reported, data on past extreme/rare events may not be available because they haven’t happened recently (but could happen tomorrow), many species may not be observed accurately because of the nature of the habitat, an adversary could corrupt your data in a cyber-attack, etc.

Me and my collaborator have developed a novel method to quantify uncertainty in the Bayesian inference due to unreliable data. In this project you will work on this cutting-edge method to develop solutions for a real-life application. This is a mathematical and computational project.

What you need:

• A sound understanding and interest in statistical inference, particularly Bayesian inference
• An interest in basic calculus and mathematical manipulations
• Good coding skills, ideally in R (but in other languages is fine too)

## Better questions, better answers: Exploring responses to short answer questions as text data to improve question design, student learning, and grading in large enrolment courses

Supervisors

Anna Fergusson
Emma Lehrke
Liza Bolton

Discipline

Statistics

Project code: SCI181

### Project

Written answers can be a valuable assessment format for learners to demonstrate their higher order reasoning and critical thinking skills. In large enrolment courses, ensuring efficient and consistent marking of these kinds of questions can be challenging.

In this project, the selected student will explore text response data from a large introductory statistics course to understand how and what features of the answers are related to higher and lower levels of thinking (and so grades). Understanding these features will enable the development of grading support tools that can help human graders to organise similar questions for batch grading. This research will be supported in the context of a wider statistics education project about designing assessment and grading tools for large enrolment statistics courses.

This project will be an excellent opportunity to learn about wrangling and analysing text data, with the possibility to extend to supervised machine learning approaches. As this research project will involve analysing data, we are seeking a student with experience/skills with coding and statistical methods, preferably in R. Coursework experience equivalent to at least STATS 220 would be a good indicator. An interest in data science and/or statistics education would be advantageous.

## Improving the accuracy of the saddlepoint approximation for count data

Supervisor

Discipline

Statistics

Project code: SCI183

### Project

The saddlepoint approximation is a systematic method for approximating an unknown density function in terms of a known moment generating function. It is useful when each individual in a large population contributes to a single random variable, and has often been used in statistical ecology.

The saddlepoint approximation works best for densities, when the underlying random variable is continuous. For discrete random variables, the traditional saddlepoint approximation works less well, and always fails at the boundary.

This project will implement new alternative saddlepoint approximations for some simple models and assess how these proposed alternatives compare to existing methods.

Experience with R programming and simulation would be a plus. The mathematical aspects of the saddlepoint approximation are not prerequisites, but mathematical applications could be explored as part of the project depending on the student. Saddlepoint approximations are related to certain contour integrals, so for a student with an interest in complex variables this project could look at complex variable methods and techniques.

## Investigating loss-based Bayesian Adaptive Randomisation

Supervisor

Discipline

Statistics

Project code: SCI184

### Project

The purpose of this project is to investigate the loss-based response-adaptive randomisation approach of Cheng and Shen (2005) by coding the approaches for continuous and binary outcomes, and identifying their operating characteristics via simulation.

Good coding skills and knowledge of R a must. Knowledge of response-adaptive randomisation a plus.

Reference: Cheng, Y., & Shen, Y. (2005). Bayesian adaptive designs for clinical trials. Biometrika, 92(3), 633–646. https://doi.org/10.1093/biomet/92.3.633

## Survival analysis of child maltreatment recurrent events

Supervisor

Discipline

Statistics

Project code: SCI185

### Project

This study compares the risk of recurrence of child maltreatment (CM) events among children exposed versus not exposed to parents with heavy alcohol use (alcohol-attributable hospitalisation or service use for mental health/addiction).
For that a cohort of all live births in New Zealand in 2000 and their parents were followed from age 0 to 17 years. The data were obtained from Statistics NZ Integrated Data Infrastructure (IDI).

As times between events (gap times) of one subject are neither independent nor identically distributed, in Meyer & Romeo (2015) we proposed a methodology based on copulas to analyse recurrent events data.

In this project we aim to extend this methodology by using elliptical copulas with serial correlation and/or pair-copulas to increase the flexibility of modelling the association structure of the CM recurrent events.

Recently a univariate survival analysis was performed where only the time to the first CM event was considered, see Huckle & Romeo (2023).

Skills required/prerequisites: Survival analysis, Multivariate analysis.

References:
Huckle, T. and Romeo, J.S. (2023). Estimating child maltreatment cases that could be alcohol-attributable in New Zealand. Addiction, 118, 669–677.

Meyer, R. and Romeo, J.S. (2015). Bayesian semiparametric analysis of recurrent failure time data using copulas. Biometrical Journal, 57, 982–1001.

## A census of recent and current trials using response-adaptive randomisation in Australia and New Zealand

Supervisor

Discipline

Statistics

Project code: SCI186

### Project

The purpose of this project is to provide a picture of the use of adaptive randomisation in current and recent randomised trials.

The source of the data will be the ANZCTR trial registry; the deliverable will be a synopsis of the trial characteristics (phase, number of arms, pharmaceutical vs non, medical field, target population, etc.) as well as the randomisation approach used (non-response-adaptive, Bayesian response-adaptive, loss-based, etc.) of trials registered within the last 3 years on ANZCTR.

Good synthesis skills and critical thinking necessary.

## Analysing pressure profiles from wearable technologies

Supervisor

Stephanie Budgett
Jenny Kruger

Discipline

Statistics

Project code: SCI187

### Project

This project involves working with real data from a pressure sensing device, examining trends and changes in the pressure profile over time due to a prescribed exercise programme or specific activities. The pressure sensing device has eight independent sensors, each generating a unique pressure trace to create a profile. These are large datasets which have had some pre-processing but further work is needed to produce useful and informative graphics and statistical analysis.

The student will require strong R/Python coding and data wrangling skills.