# Research Paper Topics Related To Statistics

Perhaps you like the paper-writing phase of research; maybe you dread it. The difference usually hinges on whether you regard yourself as a "good writer"--as determined by grades earned on countless other writing assignments. My experience with student research papers suggests that reporting the results of quantitative research is very different from other types of writing. Students who do well in creative writing may find this form of exposition more challenging; others rarely applauded for clever turns of phrase may receive compliments on their clarity of expression. Writing a research report can be a challenge for students who excel at writing essays and an opportunity to shine for those who do not ordinarily "write well." You can improve your writing performance by paying close attention to these suggestions for reporting your research.

The watchword for this type of writing is *structure**. *The format of your paper should reveal the structure of your thinking. Devices such as paragraphing, headings, indentation, and enumeration actually help your reader see the major points you want to make. If you tend to string sentences together without organizing your thoughts into paragraphs, you are not helping him or her make sense of your writing. As a rule of thumb, if you type a full page (double spaced) without indenting for a new paragraph, you almos tcertainly have run one thought into another and have missed an opportunity to differentiate your ideas.

Headings can convey the major topics discussed in your paper. A research report (see the **Lacy** article on analysis of variance) typically contains four basic components:

1. Statement of theproblemthat gave rise to the research2. Discussion of how the research was

designedto clarify the problem3.

Analysisof the data produced by the research4.

Summaryand conclusion of the study

Although you could include those sections in your report without separate headings, the underlying logic of your paper will be readily apparent with headings that identify its basic components: (1) the problem, (2) research design, (3) data analysis, (4) summary and conclusion.

For MSc and BSc(Hons) research topics that are not listed contact - Dr Brendon Brewer bj.brewer@auckland.ac.nz . For other staff members refer staff directory page.

**Is LIBS a reliable technology for the forensic analysis of glass?**

Laser Induced Breakdown Spectroscopy (LIBS) is an analytical chemistry technique used to identify and quantify the elements in a substance of interest. It has a number of advantages over competing technologies, e.g. it requires minimal sample preparation, high throughput and low cost. However, concerns have been raised regarding the repeatability (i.e. consistency of measurements made under identical experimental conditions) and reproducibility (i.e. consistency of measurements under different conditions, such as measurements made on different days) of measurements made using LIBS.

This project is concerned with the study of a unique dataset from an experiment designed to assess repeatability and reproducibility of LIBS measurements made on different types of glass and different subclasses within each type. The results of this work will help to inform the forensic community on the conditions under which LIBS analysis of glass may be considered reliable.

**Prerequisites:** The project will require a student who is a confident R user, and who has passed STATS 330, and preferably STATS 340.

**Supervisor**

Kathy Ruggiero

k.ruggiero@auckland.ac.nz

X---X

**De-batching data from a complex experiment**

‘Omic’ technologies are used to detect and, usually, quantify the entire complement molecules of a particular species in a specific biological sample, e.g. genes (genomics), mRNA (transcriptomics), proteins (proteomics) and metabolites (metabolomics). Depending on the technology and molecular species under investigation, omic experiments may yield hundreds, thousands or even tens of thousands of variables. This holistic approach to exploring the molecules within a cell is motivated by the view that complex systems may be better understood more if considered as a whole. Consequently, the datasets generated by such technologies are analysed using multivariate statistical methods. When the experimental design used to collect the data introduces non-negligible nuisance sources of systematic variation, this generally results in the multivariate analysis detecting patterns due to the nuisance variables rather than the experimental treatments of interest.

This project will explore the effectiveness of projection matrices, in the case of balanced designs, and estimated random effects from mixed models, in the case of unbalanced designs, in de-batching (i.e. removing unwanted systematic effects) omics datasets. We will first apply these approaches to simulated datasets, and then to actual experimental data collected to study the effects of ocean acidification on a New Zealand species of kina. If time permits, the effectiveness of these approaches will be compared to alternative de-batching approaches which are currently available.

**Prerequisites:** The project will require a student who is a confident R user, and who has passed STATS 330, and preferably STATS 340.

**Supervisor**

Kathy Ruggiero

k.ruggiero@auckland.ac.nz

X---X

**Title: Using Bayesian lasso to estimate the structure of a small sparse Gaussian network**

The research project is to use Bayesian methods to model the adjacency matrix of an undirected sparse Gaussian network with number of nodes < 20. In a dense Gaussian network, estimating the structure is in general through the precision matrix. In a very sparse network (the number of edge q<<p*(p-1)/2), given a** partially known** structure in the network, the joint distribution of each entry element of the precision matrix can be estimated through shrinking the entry elements of a sample precision matrix. The

**structure can be informed by published studies which has revealed the dependences among some variables (nodes of the network) of interest.**

*partially known*In this project, we propose using Bayesian lasso regression to estimate the partial correlation coefficients. Several hyper prior distributions for the regression coefficients are proposed, the joint posterior distribution of the partial correlation coefficients is derived and estimated through the regression coefficients from the observed data. The structure of the new adjacency matrix is estimated through the creditable interval or a threshold of the partial correlation coefficients. The model will be validated through simulated data and applied to a real study.

Pre-requisite course: Bayesian course 331, know how to use at least one tool for Bayesian computing.

Summary of the steps and milestones of the project (The time distribution assumes that student is working on the project full time):

1. Literature reviews (focus on lasso and Bayesian lasso) and use the existing software packages to model the simulation data) (4 weeks including writing task);

2. Derive the posterior distribution analytically (2-4 weeks) (including writing task);

3. Write the codes or programs for the model (2-4 weeks) ;

Semester break;

4. Test the codes or program using simulated data (1-4 week);

5. Optimize the program and re-test the codes (1-2 week) (including writing);

6. Use the final program on the real data (1-2 week) ;

7. Finalize writing (1 months);

**Supervisors:**

Irene Zeng (i.zeng@auckland.ac.nz) and Thomas Lumley (ts.lumley@auckland.ac.nz)

X---X

**Predicting patronage**

In recent years, internet creators have had a new way to fund their activities: by having patrons sign up for a small monthly donation, e.g., through the website patreon.com, sometimes in exchange for bonus features. However, the income can be unsteady as patrons can subscribe, unsubscribe, or change their monthly donation amounts at any time. Therefore, it would be useful to develop a probabilistic model for forecasting future income, at least in the short term, and providing justified uncertainties on these. Technically, this will involve ideas such as variable-rate Poisson processes and Bayesian inference.

The ideal student would have good grades in both STATS 331 and STATS 320, and be a confident programmer (or at least willing to become one in a relatively short timespan). If you haven't done these exact courses I will still consider you if you have good grades overall. This is a non-trivial project and would suit a student who is up for a challenge.

Supervisor: Brendon Brewer (bj.brewer@auckland.ac.nz)

X---X

**Title: Multi-catchment streamflow modelling by reduced-rank regression**

Abstract: Streamflow data (river flows, lake inflows, etc) exhibit both

temporal and spatial correlations. In this project we will fit models that aim

to reproduce these characteristics with minimal numbers of state variables

using the technique of reduced-rank regression. We will use two datasets:

one collected from New Zealand lakes and rivers, and -- more ambitiously --

a higher-dimensional one from Brazil.

Pre-requisites: STATS 20x; R programming; familiarity with the VGAM package

would be an advantage.

**Supervisor:**

Geoffrey Pritchard

g.pritchard@auckland.ac.nz

X--X

**q-type functions**

The VGAM R package estimates many distributions by full maximum likelihood estimation (see the R CRAN task views for distributions). Associated with these distributions are d-, p-, q-, and r-type functions, e.g., dunif(), punif(),qunif(), runif(). The aim of this project is to write many q-type functions that are currently unimplemented. Special attentions is needed to make the computations fast and reliable, without excessive memory requirements. This project would be of interest to those with a numerical analysis background and strong R programming skills.

**Prerequisites: **STATS 310, numerical computing skills and good R programming skills

**Supervisor: **Thomas Yee (t.yee@auckland.ac.nz)

X--X

**Multinomial Logit Model**

The aim of this project is to improve the multinomial() family function in the VGAM R package. The multinomial logit model is the standard model for regressing a nominal categorical response against a set of explanatory variables. It can suffer from numerical problems with sparse data, however, bias reduction can be a solution for this (Ding and Gentleman, JCGS, 2005). One task is to implement this within the function. Also, we could write functions to conduct a score test, as well as the Hausman-McFadden test for independence of irrelevant alternatives (IIA). Time permitting, another useful feature would be to handle the nested multinomial logit model, however this would be quite a challenge.This project would suit a student with good R programming skills and has done STATS 310 and STATS 330.

**Prerequisites**: STATS 310, STATS 330 and R programming skills

**Supervisor:**

Thomas Yee (t.yee@auckland.ac.nz)

X--X

**Analysis of Fly Fishing Data**

This project is ideally needs somebody familiar with (freshwater) fly fishing and has good R programming skills. Much data processing is first required to get the data into shape, and it is essential to know about the main flyfishing techniques such as nymphing, wetlining, dry fly, etc. The data was collected by the Department of Conservation over a long period of time, and this work is in collaboration with a DOC scientist.

**Prerequisites:** knowledge about flyfishing, STATS 330 and good R programming skills

**Supervisor:**

Thomas Yee (t.yee@auckland.ac.nz)

X--X

**Expected Information Matrices**

This is a project that is suitable for a student interested in mathematical statistics. The aim is to derive the expected information matrices (EIMs) for various statistical distributions and/or to search the literature for such. One can also derive EIMs for certain variants of some distributions, such as to allow for 0-inflation or 1-inflation, etc. We can also look at survival distributions having certain censoring mechanisms. Ideally the student has a good grade in STATS 310 or equivalent, and has a sound mathematical background.

Prerequisites: STATS 310 and some maths papers

**Supervisor:**

Thomas Yee (t.yee@auckland.ac.nz)

X--X

**Bias Reduction**

The VGAM R package implements bias reduction for some special cases of generalized linear models (GLMs). They appear as family functions with an argument called 'bred'. This project will involve simulations and developing new implementations of bias reduction for other types of GLMs, including the multinomial logit model. This project could be of interest to those with a STATS 310 and 330 bent. Some maturity in mathematical statistics is needed.

**Prerequisites:** STATS 310 and some R programming

**Supervisor:**

Thomas Yee (t.yee@auckland.ac.nz)

X--X

**Tools for climate data**

NIWA maintains the National Climate Database (CliFlo, https://cliflo.niwa.co.nz/), and is developing an interface system (CliDEsc), from which users can select data from CliFlo and then undertake various operations on it (i.e., select a weather station, and plot up seasonal rainfall patterns) using inbuilt CliDEsc tools. This project would develop an R package to perform various operations on the climate database, including extracting a range of rainfall characteristics, interactively conducting water balance calculations (relating water use to rainfall and storage tank size etc) and mapping water shortage probability. Depending on progress, other tools could be considered.

A good knowledge of R programming would be an advantage.**Supervisors:**

Thomas Lumley (t.lumley@auckland.ac.nz), Ian Tuck (ian.tuck@niwa.co.nz)

X--X

**Iterative proportional fitting**

Iterative proportional fitting (IPF) is a commonly used algorithm for maximum likelihood estimation in loglinear models, first proposed by Deming and Stephan in 1940. In this project, an overview should be given about the analysis of contingency tables, the IPF algorithm should be introduced, contrasted to other algorithms such as Newton-Raphson, and a convergence proof given.

Its application should be demonstrated using nonlinear models for two- and higher-dimensional contingency tables. Familiarity with loglinear or generalised linear models is an advantage.

**Supervisor**

Renate Meyer

renate.meyer@auckland.ac.nz

X--X

**Short paths in random graphs**

Take *n* locations and join every pair of locations by an edge of random length. Then, for any two locations, we can find a *shortest path* joining the two locations, where the length of a path means the sum of lengths of the edges along the path. Typically, the shortest path between two locations will not consist of just one edge (even though we declared that every pair of locations had an edge directly joining them) because both locations are likely to have many edges that are much shorter than the single edge between them. Thus, the shortest path between two typical locations will be random, with a distribution that depends on *n*, the total number of locations.

The distribution of the shortest path will also depend on the *distribution of edge lengths*, and we want to know how changing edge length distribution affects the resulting shortest path distribution. Some changes have a simple effect: for instance, doubling all the edge lengths makes the shortest path twice as long too, but does not change which locations the shortest path passes through. Could we change the edge length distribution to make the shortest path pass through more or fewer locations, on average?

The brief answer is yes, by changing the tail behaviour of the edge length distribution. Indeed, it is possible to acheive a range of shortest path behaviours, from paths that pass through almost *n^{1/3}* locations, to paths that are as short as possible with only one extra location. However, the full picture is not known. For instance, in some cases the shortest path will pass through an essentially random number of locations, with an approximately normal distribution, while in other cases the shortest path passes through an essentially non-random number of locations that can be determined in advance.

This project aims to determine which edge length distributions result in each of these possible behaviours. More generally, the aim is to classify edge length distributions, and to determine all possible behaviours, along a spectrum linking these two extremes. The project will involve branching processes, generating functions, and limits. It is suitable for a student familiar with stochastic processes and with proofs; Stats 325 would be recommended.

**Supervisor:**

Jesse Goodman

jesse.goodman@auckland.ac.nz

X--X

**Beyond temperature: climatic drivers of snapper recruitment**

This project will explore trends in an extensive time series of snapper recruitment strength collected from catch-at-age sampling. We know that in northern New Zealand there is a strong correlation between snapper recruitment and sea surface temperature, but it is possible that this relationship is actually driven by other climatic variables. Establishing the primary environmental drivers of snapper spawning success is highly relevant considering our changing climate and also for developing an early indicator of recruitment strength for snapper stocks where recruitment is highly variable.

Could be an honours or masters project. Knowledge of advanced regression, time series and multivariate approaches would be useful.

Supervisors (Ian Tuck - Stats, Darren Parsons - Marine Science)

ian.tuck@niwa.co.nz

X--X

**Constrained Estimation using group-summary level Information**

In public health research information that is readily available may be insufficient to address the primary question(s) of interest. In resource-limited settings, one cost-efficient way forward is to make use of information available at phase I. This information is usually grouped-level information. Some existing methods such as calibration or estimation of the sampling weights have been shown to yield large improvements in efficiency. However this process can be cumbersome when to find appropriate covariates to estimate or calibrate the weights. An additional method has been recently proposed by [Chatterjee et al., 2016]. They considered the problem of building regression models based on individual-level data from an internal study while using summary-level information, such as information on parameters for reduced models, from an ’external’ big data source. They identified a set of very general constraints that link internal and external models and developed a framework for semi parametric maximum likelihood inference. This method is 'cleaner' because the investigator does not intervene in the process. This project consists in extending their work to cases where the external model is not an individual-level model. What would happen if the external model is aggregated or cluster level? For example, the external outcome is the mean or counts in different clusters and the external covariates are also cluster level. I have written some short notes if somebody is interested. The student should have some R, sampling and modeling knowledge.

Level: PhD or Masters

Supervisor: Claudia Rivera-Rodriguez

c.rodriguez@auckland.ac.nz

X--X

**WGEE (Inverse probability weighting for GEE) estimation for Normal, Poisson and Gamma models**

One cost-efficient way to study health outcomes in resource-limited settings is to conduct a two-phase study in which the population is initially stratified, at phase I, by the outcome and/or some categorical risk factor(s). At phase II detailed covariate data is ascertained on a sub-sample within each phase I strata. While analysis methods for two-phase designs are well established, they have focused exclusively on settings in which participants are assumed to be independent. WGEE is an analysis approach based on inverse-probability weighting (IPW) that permits researchers to specify some working covariance structure, appropriately accounts for the sampling design and ensures valid inference via a robust sandwich estimator when the data is clustered. In addition, to enhance statistical efficiency, it is possible to use a calibrated IPW estimator that makes use of information available at phase I. This method and the appropriate estimator of the variance has been programmed for Binary outcomes with canonical and logit links (Binomial) One of the goals is to demonstrate the efficiency of the estimators for different model type. The second aim is to construct an R package and release in R cran. The initial package (for binomial models) is about to be released in R Cran, but the idea is to continue adding the following methods:

• Normal: identity, log and inverse

• Binary: probit, log-log links

• Poisson: log, identity and square root links

• Gamma: Inverse, log , identity 1

Level: PhD or Masters

Supervisor: Claudia Rivera-Rodriguez

c.rodriguez@auckland.ac.nz

X--X

**Comparing different estimation methods for misspecified models**

This project consists in comparing the following estimation methods under true and misspecified models. - Weighted likelihood [Saegusa and Wellner, 2013] - Conditional maximum likelihood [Scott and Wild, 2002] - Weighted likelihood with estimated/calibrated weights [Breslow et al., 2009] - Conditional maximum likelihood with estimated weights [Breslow et al., 2009] - Constraint maximum likelihood [Chatterjee et al., 2016] *

If time allows (PhD), the ultimate goal is to incorporate clustered correlated data and proposing robust estimation of the variances for some of the methods. This has already been dome for weighted likelihood and weighted likelihood with calibrated/estimated weights. The student should have good R, sampling and modeling knowledge.

Level: PhD or Masters

Supervisor: Claudia Rivera-Rodriguez

c.rodriguez@auckland.ac.nz

X--X

**Optimal sample sizes to improve estimates of regression parameters**

The majority of methods to improve efficiency of estimators have been focused on ad-hoc methods or balanced stratified designs [Breslow et al., 2009]. When targeting estimates of totals with stratified designs, survey statisticians usually allocate their sample size using a variety of optimal allocation methods, which are described in Sarndal et al. [1992]. Generally, one makes use of a variable that is highly related to the variable of interest. For regression parameters, on the other hand, it is not clear how to allocate the sample. One of the reasons is that the regression parameters are not exactly a weighted total. However, a linear approximation can be found and used to allocate the sample. If this variable was available for all subjects, then standard methods from survey statisticians could be used, but it is usually unknown. One way to solve the problem is to estimate this variable. This project consists in evaluating different methods to estimate this variable and to assess the efficiency of the estimators. Additionally, it is also of interest to compare the results to methods such as balanced or proportional sampling designs. The student should have good R, sampling and modeling knowledge.

Level: PhD or Masters

Supervisor:

Claudia Rivera-Rodriguez

c.rodriguez@auckland.ac.nz

X--X

**Study on pass rates by whether students have English as a first or second language**

Equity in education is one of the main goals of the university of Auckland. Students from a non-English speaking background, however, face many challenges that may affect this equity goal. They need to function in English to succeed as their English speakers peers do. Even though certain level of English is required to be accepted to the university, there are many difficulties faced with language such as speed and accents, among others. Mathematical English, in particular, presents additional challenges as this is different to natural English - it is very precise and dense. In addition, there is a lack of resources for non-English speakers to learn the terminology. This makes the communication difficult between lecturers/teachers and students. For example, a recurrent event with students whose first language is not English is that they do not attend the lectures because they do not manage to follow the contents. This can be mainly due to language barriers. It is not only that students do not attend lectures, but they do not attend tutorials and other forms of help as much as they should. From my experience when I was a GTA, some of the students that attend tutorial rooms show a concerning lack of understanding due to language barriers. One of our interest is evaluating the relationship between a student's first language and their final grades. In particular, are students with English with a second language more likely to give up on papers and get a DNS grade? The project will also take into account other demographic characteristics such as ethnicity and gender. Specifically, we want to use records from about students in enrolled in a range of statistics courses: STATS 101/108, STATS 125 and STATS 201/208. It is of interest to compare how any effects of English as a second language differ between these papers. The project will focus on Semester on first semester 2017 results.

Level: Masters

Supervisor: Claudia Rivera-Rodriguez

c.rodriguez@auckland.ac.nz

**Reference list for Rivera-Rodriguez's projects**

N. Breslow, T. Lumley, C. Ballantyne, L. Chambless, and M. Kulich. Improved Horvitz–Thompson estimation of model parameters from two-phase stratified samples: Applications in Epidemiology. Statistics in Biosciences, 1(1):32–49, 2009. ISSN 1867-1772. doi: 10.1007/s12561-009-9001-6. URL http://dx.doi.org/10.1007/s12561-009-9001-6. N. Chatterjee, Y.-H. Chen, P. Maas, and R. J. Carroll.

Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. Journal of the American Statistical Association, 111(513):107–117, 2016. doi: 10.1080/01621459.2015.1123157. URL http://dx.doi.org/10.1080/01621459.2015.1123157.

T. Saegusa and T. Wellner. Weighted likelihood estimation under two-phase sampling. The Annals of Statistics, 41(1):269–295, 2013. URL http://dx.doi.org/10.1214/12-AOS1073.

C. E. Sarndal, B. Swensson, and J. Wretman. Model Assisted Survey Sampling. Springer Series in Statistics, 1992. A. Scott and C. Wild. On the robustness of weighted methods for fitting models to case-control data. Journal of the Royal Statistical Society, 64(2):207–219, March 2002.

X--X

**Effects of replication and site representativeness on estuarine state**

Environmental and ecological data are collected from estuaries around New Zealand. However, the level of replication and site position within the estuary is variable. It is unknown how much these differences may affect comparisons across the country of degree of muddiness, metal contamination or biodiversity. This project would interrogate present data to determine the likely effects of differences in replication and site position.

Prerequisites: R programming to extract and analyse random subsets, a marine science course, ability to calculate biodiversity indices.

Supervisor: Prof. Judi Hewitt (NIWA)

Judi.Hewitt@niwa.co.nz

## 0 Replies to “Research Paper Topics Related To Statistics”