Symposium
Leveraging Advanced Statistical Methods in Empirical Educational Research: Handling Missing Data and Harnessing Machine Learning Methods
Chair(s): Jakob Schwerter (Technische Universität Dortmund, Deutschland)
Diskutant*in(nen): Sven Hilbert (Universität Regensburg)
Empirical educational research is an evolving field that relies heavily on the careful application of statistical methods to validate educational theories. Given the diverse nature of educational data, the choice of appropriate methods becomes critical. Novel data types often entail that new methods must be used or even developed. Especially, empirical research has to treat missing values (van Buuren, 2018). When should which method be used to evaluate data with missing values without biasing the results? For missing data treatment and beyond, machine learning (ML) methods are becoming more important and offer possibilities that did not exist before. For example, ML methods can help process a large number of variables, ideally in an interpretable way. For example, selecting only the important variables to make regressions more robust or to highlight the relative importance of variables compared to other variables (Schwerter et al., 2022; van Lissa et al., 2023). This symposium will highlight cutting-edge ML methods and how they are advancing our ability to process diverse data efficiently and interpretably.
Our session will begin with two presentations focusing on the handling of data with missing information. The first presentation will focus on challenges and solutions related to smaller datasets where multiple imputation cannot be applied. The second presentation will explore the performance of tree-based imputation methods compared to multiple imputation by chained equations with predictive mean matching (MICE PMM) applied to larger datasets. In doing so, the two papers highlight recent methodological advances in dealing with data containing missing information, thus assisting the researcher in analyzing the data.
The third paper focuses on Prediction Rule Ensembles (Fokkema & Strobl, 2020), a tree-based interpretable machine learning technique that provides researchers with prediction rules, i.e., nonlinear interactions of variables. The presentation shows how it can be implemented and emphasizes its effectiveness for missing data. Our final paper focuses on presenting the use of transformer models to categorize open-ended responses in educational assessments and highlighting the ethical implications involved. Thus, the second part of the symposium focuses on two different use cases where ML methods assist the researcher in analyzing the data.
After a short introduction by the symposium chairs (3-5 minutes), all symposium participants will have 15 minutes to present their study and 1-2 minutes for clarifying questions. This will be followed by a critical discussion of the papers by the discussant, who is an expert in educational data science, ML, and statistics. The symposium will conclude with an open discussion (5-10 minutes). The symposium will highlight some of the intricacies of current quantitative methods and their growing potential in empirical educational research.
Beiträge des Symposiums
Using tree-based imputation methods in comparison to MICE for longitudinal and multilevel data
Jakob Schwerter, Ketevan Gurtskaia, Andres Romero, Birgit Zeyer-Gliozzo, Philipp Doebler, Markus Pauly TU Dortmund University
Theoretical Background
Missing information is common in research and can have a significant impact on the statistical. Therefore, dealing with missing data is critical to drawing reliable conclusions. Simply ignoring missing data often leads to biased results and incorrect conclusions (Collins et al., 2001, van Buuren, 2018). To provide robust and reliable results, multiple imputation (MI) is one of the most commonly used methods for dealing with missing data. It adequately accounts for the uncertainty caused by missing values (Rubin 1987, van Buuren, 2018). MI creates multiple plausible imputation sets and performs the analysis on each of these sets, allowing for a realistic capture of uncertainty. MI takes data structures into account in the imputation process. However, there is a risk of using the wrong imputation model. This is particularly problematic for multilevel data structures, where individual data are organized at different levels or hierarchies, and for panel data, also known as longitudinal data, where data are collected repeatedly over time for the same observation units. Both types of data are common in empirical research, such as educational research, social sciences, epidemiology, economics, and environmental research. They require special statistical models and analysis techniques to adequately account for relationships between levels and temporal changes in the data. The current standard in the empirical literature is the MICE imputation method with predictive mean matching (PMM, van Buuren & Groothuis-Oudshoorn, 2011). However, as data sets become more complex, the traditional MICE approach reaches its limits and researchers seek tree-based imputation methods. In addition, recent studies have also used tree-based imputation methods, although the performance and validity are not clear for all, especially compared to the standard MICE PMM.
Research question
In this study, we use two simulation studies to investigate how different imputation methods affect coefficient estimation and Type I and Type II errors, in order to gain insights that can help empirical researchers deal with missing values more effectively. Therefore, we use MICE and predictive mean matching (PMM) with different tree-based methods, such as MICE with random forest (RF) and chained random forests with and without PMM (missRanger, Mayer, 2019).
Method
We used two student simulations to address our research question. The first uses a data structure motivated by the longitudinal data from the 6th cohort of the National Educational Panel Study (NEPS; see Blossfeld, Rossbach, and Von Maurice, 2011) with 5 waves, while the second includes synthetic cross-sectional data with a two-level structure. In both cases, we simulate the dataset, impute the missing data using the mentioned imputation methods, and then run OLS and fixed effects regressions in Study 1, while we run OLS, random intercept, and random slope regressions for Study 2. We examine both bias and power over 1000 replications in order to generalize our results.
Results
For Study 1, our results show that Random Forest-based imputations, especially MICE Random Forest and missRanger with PMM, consistently perform better in most scenarios. The standard MICE with PMM has partially increased bias and (overly) conservative test decisions (with non-true zero coefficients). Thus, our results show the advantages of tree-based imputation methods.
For study 2, our results show that considering multilevel structure in missRanger reduces Type I errors and improves the decision making at level 1 most of the times. While missRanger compared or even outperforms MICE at level 1. MICE is still superior for level 2 variables.
In general, our results show the potential of tree-based methods. Furthermore, in our presentation we will quickly show how one can easily use tree-based methods to impute missing data.
Resampling-Based Approaches for Nonparametric MANOVA in the Presence of Missing Data
Lubna Amro, Markus Pauly TU Dortmund University
Theoretical background
Repeated measure designs and split plot plans are widely employed in scientific and medical research. The analysis of such designs is typically based on MANOVA models, requiring complete data, and certain assumptions on the underlying parametric distribution, such as normality or homogeneity of covariance matrices among groups. While various nonparametric multivariate methods have been proposed to address the distributional assumptions, the issue of missing data remains. To tackle this issue, a simple approach is to perform single or multiple imputations of missing values and subsequently conduct statistical tests as if there were no missing values at all. However, it's important to note that these approaches may result in an inflated type-I error rate or remarkably low statistical power if the dataset is small (Van Buuren, 2018; Ramosaj et al., 2020). Therefore, we do not follow this approach here.
Research question
In this study, we address the challenge of missing data in MANOVA analyses. We present our recently developed asymptotically correct procedures, which are capable of effectively handling data with missing values without requiring imputation or the exclusion of observations. Importantly, these procedures do not require the assumptions of multivariate normality and allow for covariance matrices that are heterogeneous between groups.
Method
To achieve this, we propose applying an appropriate resampling method in combination with quadratic form-type test statistics (specifically, the Wald-type statistic, ANOVA-type statistic, and modified ANOVA-type statistic). This approach allows us to overcome the limitations posed by missing data and relaxes the distributional assumptions that are typically required in MANOVA.
In addition to proving the asymptotic validity of our methods, we analyze the finite sample behavior of the asymptotic quadratic tests and their wild bootstrap counterparts in an extensive simulation study. As an evaluation criteria, all procedures were examined for type-I error rate control at significance level α = 5% and assessed for their power in detecting deviations from the null hypothesis. We consider various scenarios involving different number of groups and time points underlying symmetric and skewed distributions. Additionally, we explore three distinct covariance structures: an autoregressive structure, a compound symmetry pattern, and a linear Toeplitz covariance structure. Missingness was generated within both Missing Completely At Random (MCAR) and Missing At Random (MAR) frameworks. Each setting was based on 10,000 simulation runs, with B = 1,000 bootstrap runs.
Results
Our study's intermediary results show that our proposed bootstrap methods, based upon quadratic form tests, tend to result in quite accurate type-I error rate control for most settings, whether the data distribution is symmetric or skewed, and regardless of whether it follows MCAR or MAR mechanisms. In contrast, the asymptotic Wald test shows an extremely liberal behavior in all tested scenarios and under all investigated missing data mechanisms. As a result, the combination of simulation results and theoretical validity makes the new bootstrap procedures recommendable in practice. In particular, we recommend using the wild bootstrap ANOVA-type statistic, as it offers the best overall type-I error control and demonstrates great power performance.
Prediction Rule Ensembles: Introduction and Application with Mutliple Imputation
Philipp Doebler1, Marjolein Fokkema2, Vincent Schroeder1, Jakob Schwerter1 1TU Dortmund University, 2Leiden University
Theoretical Background
Statistical prediction is a cornerstone of disciplines such as empirical educational research, psychology, and others. It involves building statistical models to predict the value of a target variable using variables. Examples of applications range from predicting dropout rates (Niessen et al., 2016) to personality and personality disorders (Kosinski et al., 2016). However, with the increasing availability of data, there has been a paradigm shift towards machine learning techniques that focus on optimizing predictive accuracy on unseen data. This is in contrast to traditional statistical explanatory approaches, which aim to understand the process of data generation. These traditional methods are often based on assumptions, such as normally distributed residuals, that do not always match actual data sets. They also reach their limits when the number of predictor variables becomes too large. Given these challenges, prediction rule ensembles (PREs; Fokkema, 2020; Fokkema & Strobl, 2020) have emerged as a promising approach to interpretable machine learning. Unlike other ensemble models, such as random forests, PREs strike a balance between predictive accuracy and model interpretability. Derived from decision tree ensembles, PREs condense to a model that contains only a selected subset of tree nodes. These nodes are manifested as simple rules, making PREs less complex than full tree ensembles without compromising their predictive power. Given this complexity, PREs function as models in the interpretable machine learning domain. Their strength lies in achieving predictive accuracy comparable to other ensemble models while maintaining model transparency. In the final version of the PRE, only a limited number of rules are used through variable selection, so that the predictions are understandable.
The purpose of this presentation is to show how and to what extent the handling of missing data affects the performance and structure of PREs. Unlike linear regression, it is not possible to simply run the statistical analysis on all imputed datasets and then pool the results. One promising approach is to "stack" the data, combining all imputed records into one large dataset (Du et al., 2022; Gunn et al., 2022).
Methods
In a simulation study, we compared the performance of PREs on a complete dataset, a dataset with missing values, and a dataset with multiple imputations to show whether multiple imputation is suitable for use with PREs in the field of interpretable machine learning in practice. Therefore, we used a full factorial design with 64 different simulation conditions, each replicated 1000 times. The sample size (N = 400, N = 1000), the number of variables (p = 10, p = 20), the proportion of influential variables and their interactions with non-zero coefficients (sparcity; 1%, 10%), the number of multiple imputed records (D = 5, D = 20), the proportion of rows with missing values (missing; 5%, 40%), and the mechanism for generating missing values (MCAR, MAR) are varied.
Results
MI showed improved prediction accuracy compared to listwise deletion (LD), especially in scenarios with smaller sample sizes and higher proportions of missing data. Models trained with MI contained more selected rules than those trained with LD, but fewer than models based on complete data sets. Overall, MI appears to improve prediction quality without significantly changing the structure of the model, making it suitable for interpretable machine learning applications.
Between now and the conference, the methodology will be further developed to add a measure of stability and guidelines for the use of PREs.
A Pilot Study on the Use of Transformer Models to Evaluate Open-Ended Response Formats in Educational Assessments
Rudolf Debelak1, Benjamin Wolf2 1Department of Psychology, University of Zurich, 2Institute of Education, University of Zurich
Theoretical background
In recent years, numerous software tools have been made available in programming languages such as R and Python that allow the application of deep learning models to evaluate text, images, and numeric variables. Because of their flexibility, such models can be used to assist human raters in evaluating open-ended response formats. One challenge of these applications is the necessity to investigate the validity of the models, since deep learning models are black-box models.
Research question
This study investigates the feasibility of using transformer models for the automated evaluation of open-ended responses in the educational domain. In addition to the practical application, we discuss the use of tools from the field of interpretable machine learning to investigate the validity of the model.
Method
Our data set consists of 125 texts collected as part of the educational assessment "Check your knowledge" in Switzerland and rated by a human rater. 100 texts were used as training data set, and 25 texts as test data set. The rating of these texts was done by combining a deep learning model (Bidirectional Encoder Representations from Transformers or BERT; Devlin et al., 2018) with regularized linear regression.
Results
As a preliminary result, we found a correlation of 0.93 between the predicted and the observed value in the training dataset and a corresponding correlation of 0.71 in the test dataset. We also illustrate the use of interpretable machine learning methods to investigate the validity of the estimated model.
|