Paper Session
Examining the Analytic Reproducibility of Secondary Data Analyses in Educational Research
Aishvarya Aravindan Rajagopal, Aleksander Kocaj, Malte Jansen
Institut zur Qualitätsentwicklung im Bildungswesen, Humboldt-Universität zu Berlin
Theoretical background and research questions
Evaluating the reproducibility of research findings is an important step in ensuring sound, efficient, and trustworthy research. There are different approaches to assessing the reproducibility of published results (i.e., the numerical correctness and consistency), depending on whether the data and the analytic code are available (Hardwicke et al., 2018, 2021; Laurinavichyute et al., 2022; Stodden et al., 2018). We use the term analytic reproducibility to refer to re-analyzing published results by an independent research team using the same dataset and applying the same statistical analyses, but without the original code. Using a similar approach, Artner et al. (2021) examined the analytic reproducibility of 46 articles from three psychology journals published in 2012. Overall, they could reproduce 70% (n = 163) of the 232 key statistical claims in the articles. However, the authors described the reproduction process as laborious and time-consuming, involving a lot of trial and error, which was exacerbated by vague documentation of data processing and data analyses in the original manuscripts.
Our study aims to examine the reproducibility of research based on secondary data analysis in education. The project is part of the priority program "META-REP: A Meta-scientific Programme to Analyse and Optimise Replicability in the Behavioural, Social, and Cognitive Sciences," funded by the German Research Foundation (DFG). We aim to test the reproducibility of results from papers based on large-scale school assessment data that are available at the research data centre of the Institute for Educational Quality Improvement (e.g. the data from national and international large-scale assessments in Germany). Between 2012 and 2022, there were 82 publications in peer-reviewed journals based on these datasets.
Method
We selected a sample of 30 articles from those publications to reproduce. Sample selection was based on the reproduction teams’ expertise and familiarity with the research topics, datasets, and statistical methods, which might positively bias our reproduction estimates. Each of the 30 selected articles will be reproduced by a researcher who leads the reproduction (the reproducer). Based on previous research, we anticipate that at least 60% of the central claims should be reproducible (Artner et al., 2021; Hardwicke et al., 2018; Hardwicke et al., 2021).
In the first step of reproduction, key information from the paper will be identified and entered into a template developed for this study. Then, there will be a stepwise reproduction effort, which starts with a trial-and-error phase. Reproduction success will be determined based on the difference between the original and reproduction numerical values associated with the central claims (e.g., regression coefficients, standardized mean differences, correlation coefficients). We will differentiate three levels of reproduction: Precise reproduction (original values and reproduction values match), approximate reproduction (≤10% difference in the estimates between original study and reproduction effort), and non-reproduction (>10% difference between original and reproduction values). When initial results fall under non-reproduction, we will seek author assistance as a next step. The reproduction will be reiterated when the required information is made available to obtain the conclusive outcome.
In the proposed presentation, we will present our study design, template, and the first results of our reproduction efforts. Furthermore, we will discuss multiple error sources of our reproduction approach (e.g., deviations between the analysis code of the original researchers and our interpretation of this code based on the description in the paper). Our results might provide hints on improving the description of the research process (e.g., recommendations for reporting that aid the reproduction of results; Artner et al., 2021).
Paper Session
Heating Up! Using the MAGMA Algorithm to Balance out Complex Study Designs in Educational Field Research
Julian Urban1,2, Markus Daniel Feuchter1, Franzis Preckel1
1Universität Trier; 2GESIS - Leibniz Institut für Sozialwissenschaften
Theoretical background
Many studies in educational contexts are observational without the possibility to randomize study participants. To deal with the resulting lack of experimental control, propensity score matching (PSM; Rosenbaum & Rubin, 1985) has become a common procedure of post-hoc balance control in educational research. It allows accounting for systematic differences in baseline characteristics (i.e., covariates) between treated and untreated subjects by matching them individually, based on a distance measure (i.e., the propensity score; PS).
However, current PSM applications have several limitations. First, PSM is restricted to two-group designs. Secondly, within common matching packages (e.g., MatchIt, Ho et al., 2011), different matching solutions must be extracted and compared successively. Thirdly, beyond comparing pairwise standardized mean differences (i.e., Cohen’s d), a comprehensive framework for evaluating the post matching balance in covariates (i.e., the matching quality) is missing.
To address these limitations, we developed the Many-Group Matching (MAGMA) algorithm and the MAGMA R package (Urban et al., 2023a). MAGMA uses a systematic nearest neighbor matching approach leading to one unambiguous matching solution that can be produced for two or more groups. Furthermore, we developed a balance estimation framework using four balance criteria, namely Pillai’s Trace, d-ratio, mean g, and adjusted d-ratio (Feuchter et al., 2023, Urban et al., 2023b), embedded in the MAGMA package.
Research question
The aim of this study was to (1) validate MAGMA using a two- and a three-group example and to (2) compare matching solutions for the two-group example produced by MAGMA and MatchIt side-by-side.
Methods
We used two data sets taken from longitudinal educational studies conducted in German schools. Data Set 1 (N = 914 five graders, Mage = 10.53 years, SDage = 0.55 years, 41% female) was used as two-group example. The grouping variable coded the differentiation of regular classrooms (RC, n = 631) and gifted classrooms (GC, n = 283). We considered 13 covariates including demographics, achievement tests, IQ-scores, and questionnaire scales (e.g., need for cognition).
Data Set 2 (N = 1,238 five graders, Mage = 10.11 years, SDage = 0 .58 years, 46% female) was used as three-group example. We grouped the students using an IQ-range variable (1 = IQ ≤ 106, n = 453; 2 = 106 < IQ ≤ 115, n = 391; 3 = IQ > 115, n = 394). We considered 32 covariates covering similar constructs as for the two-group example.
We conducted all analyses using R (4.1.2; R Core Team, 2021) and matched the data based on PSs estimated in twang (Ridgeway et al., 2015) using either MatchIt (Ho et al., 2011) or MAGMA (Urban et al., 2023). We extracted respective matching solutions and examined their quality by our four balance criteria. Additionally, we compared the balance criteria for the MatchIt and MAGMA solutions of the two-group example.
Results and discussion
For the two-group example, both algorithms reduced the effects of covariates significantly (e.g., all pairwise effects smaller than |d| < 0.20; Pillai’s Trace reduced from V = .41 to V < .05). However, MAGMA achieved comparable or better balance and produced a higher post-matching sample size than MatchIt. Moreover, MAGMA was able to find a well-balanced solution in the three-group example (e.g., reduced Pillai’s Trace from V = .26 to V = .06). Thus, we found first evidence for the usefulness of MAGMA, which we plan to extend by presenting results with simulated data.
MAGMA does not only address drawbacks of PSM but expands current algorithms to three-groups, four-groups, and 22 designs. This enables applicants to approximate causal inference within more complex, non-randomized research designs in education.
Paper Session
Intensive Longitudinal Methods in School Research: A Systematic Literature Review
Carina Schreiber1, Michael Becker1,2
1TU Dortmund, Deutschland; 2Leibniz-Institut für Bildungsforschung und Bildungsinformation (DIPF), Deutschland
Everyday school life is full of dynamic processes: Students’ emotional and cognitive experiences in the classroom, teacher-student and peer interactions, individual learning processes, changing contexts in different classes, and different teachers and their individual instructional behaviors, to name but a few. These dynamic processes have crucial effects on central factors in the school context such as students’ educational success (Blume et al., 2022), quality of instruction (Janna et al., 2019; Järvinen et al., 2022), or teachers’ well-being (Aldrup et al., 2017; Jõgi et al., 2023). To capture the dynamics of everyday school life, cross-sectional or longitudinal methods do not suffice; researchers have to zoom in on students’ and teachers’ experiences on a finer level. Intensive longitudinal methods such as the Experience Sampling Method, daily diaries, or ambulatory assessments allow researchers exactly that. Over the span of days or weeks, intensive longitudinal studies repeatedly inquire their subjects about their experiences, emotions, cognitions, and behavior as they occur in everyday life. As such, the method allows for innovative ways of data collection and can literally open researchers the door to the classroom and the dynamic processes behind it, opening new approaches for descriptive and causal analyses. With all their benefits and possibilities, it is no surprise that in recent years intensive longitudinal methods have vastly grown in popularity (Kirtley et al., 2021) and have become of interest to various research fields. The design and implementation of these studies, however, require special consideration as they differ from common, more established research methods and might confront researchers new to the method with unknown difficulties. Considering that school research rather recently became aware of the methods’ potential (Zirkel et al., 2015), it is unclear where and how the field is applying intensive longitudinal methods. Furthermore, other research fields, in which these methods have been long established, were still shown to report the method incompletely and to miss rationales for methodological choices (Trull & Ebner-Priemer, 2020) posing a substantial problem for the quality and replicability of research.
The aim of this contribution is to gain a systematic overview of the use of ESM in school research and to point out possibly existing shortcomings and threads in their (reporting of) methodological choices and rationales. We further aim to raise awareness of the central role of appropriate methodological choices and consistent, accurate, and transparent reporting of these choices for the quality and replicability of research and their implication for interpretation. Finally, the present review shall unify and guide the construction of future ESM studies and articles and support school research in making use of the method’s full potential.
This Systematic Literature Review was conducted in line with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA; Moher et al., 2009). Accordingly, the databases Web of Science, Scopus, ERIC, and PsychInfo were systematically searched for studies applying intensive longitudinal methods in school research. The literature search yielded 993 papers, of which 288 qualified for full-text screening. These remaining papers are currently examined for eligibility.
We expect the studies to show a wide variety in the application of intensive longitudinal methods with regard to both the constructs under investigation and methodology. Considering that even research fields in which these methods are long-established (Trull & Ebner-Priemer, 2020) methodological choices and their rationales are reported insufficiently, we expect that also in school research a significant number of intensive longitudinal studies will lack transparency in reporting methodological choices and rationales.
Paper Session
Idiografische und nomothetische Netzwerkanalysen zur Integration der Situierten Erwartungs-Wert-Theorie der Leistungsmotivation mit der Kontroll-Wert-Theorie akademischer Emotionen: Erste Ergebnisse aus dem ManyMoments-Projekt
Jessica Baars1, Miriam Francesca Jähne2, Julia Dietrich2, Jana Holtmann1, Martin Daumiller3, Julia Moeller1
1Universität Leipzig, Deutschland; 2Friedrich-Schiller-Universität Jena, Deutschland; 3Universität Augsburg, Deutschland
Kurzzusammenfassung:
Diese Präsentation stellt Ergebnisse aus einer kollaborativen Datenerhebung, dem ManyMoments-Projekt, vor. Mit der Experience Sampling Methode wurden in universitären Lehrveranstaltungen Daten zu situationsspezifischen akademischen Emotionen und Motivationen nach Pekruns (z.B. 2006) Kontroll-Wert-Theorie und Eccles und Wigfields (z.B. 2002; 2020) Situierter Wert-Erwartungs-Theorie erhoben. Damit wurde das neuentwickelte DYNAMICS-Rahmenmodell empirisch untersucht (Moeller et al., 2022), das methodische Innovationen für die Integration beider Theorien eröffnet (z.B. Einbezug von Netzwerkanalysen, Unterscheidung idiografischer und nomothetischer Modelle, Unterscheidung von State-und Trait-Systemen).
Theoretischer Hintergrund:
Diese Studie integriert zwei Theorien, die teilweise aufeinander aufbauen, erkenntnisreich füreinander sind, aber oft getrennt erforscht wurden: Die Situierte Wert-Erwartungstheorie der Leistungsmotivation („SEWT“; z.B. Eccles & Wigfield, 2002; 2020) und die Kontroll-Wert Theorie akademischer Emotionen („KWT“; z.B. Pekrun, 2006).
Kürzlich wurde zur theoretischen und methodischen Integration der Erkenntnisse aus beiden Theorien das DYNAMICS-Rahmenmodell vorgeschlagen (DYNamics of Achievement Motivation In Concrete Situations; Moeller et al., 2022). Es integriert beide Theorien der mit Konzepten und Methoden aus den dynamischen Systemtheorien und soll letztere fruchtbar machen für die Erforschung von lernrelevanten Motivationskomponenten und Emotionen. Das DYNAMICS-Rahmenmodell beschreibt die Veränderung von Motivationen und Emotionen als komplexes System, in dem zeit- und kontextabhängige Zustände (States) und stabile Personenmerkmale (Traits) miteinander wechselwirken. Zur besseren Beschreibung der Systeme auf Zustands- und Personenebene schlägt das Modell die Verwendung von Netzwerkanalysen vor. Das Rahmenmodell berücksichtigt aktuelle Methodendebatten (z.B. Molenaar, 2004), indem es zwischen personenspezifischen idiographischen Zustands-(„State“-)modellen und generalisierbaren nomothetischen Zustandsmodellen unterscheidet, um dem Problem mangelnder Ergodizität in intensiven Längsschnittdaten (Voelkle et al., 2014) zu begegnen. Dafür wird jeder Zusammenhangskoeffizient zunächst innerhalb jeder Person berechnet, für jede Person ein idiografisches Netzwerk von Zusammenhängen zwischen motivations- und Emotionsfacetten über die Zeit hinweg berechnet, und anschließend analysiert, welche dieser idiografischen Koeffizienten sich wie oft über Personen hinweg generalisieren lassen (siehe Asendorpf, 1993; 2000).
Daten:
Das Forschungsdesign folgt demjenigen von Moeller, Dietrich und Kollegen (2022; 2020; Dietrich et al., 2017; 2019). Hierbei kamen bewährte und etablierte Messinstrumente zum Einsatz. Das kollaborative ManyMoments-Projekt lieferte situationsspezifische Messungen (Experience Sampling Method) der Leistungsemotionen und Lernmotivation. Pro Vorlesung wurden Studierende drei Mal befragt. Daten aus den ersten vier Lehrveranstaltungen (Sommersemester 2022) sind bereits ausgewertet, bis zur Präsentation werden auch die Ergebnisse vorgestellt, die im aktuellen Wintersemester erhoben und bis Februar 2024 ausgewertet werden (Daten aus 11 Vorlesungen und 20 Seminaren an deutschen Hochschulen).
Analysen:
Diese Studie analysiert Assoziationen (Korrelationen, Ko-Okkurrenzen) zwischen Komponenten der SEVT und der KWT. Dazu werden sowohl die Kovarianzen (zero-order-Korrelationen und Partialkorrelationen) als auch die bivariat gemeinsamen Zustimmungen (Ko-Okkurrenzen) zwischen den Facetten der lernrelevanten Motivation und Emotion analysiert, in Netzwerken dargestellt und hinsichtlich ihrer Erkenntnisse verglichen (Abbildung 1: online https://speicherwolke.uni-leipzig.de/index.php/s/aZ4ssLjzsNwNP3C).
Die beiden korrelationsbasierten Netzwerke werden mit Mehrebenenanalysen jeweils auf der intra-individuellen Zustands-Ebene (Level 1) und der inter-individuellen Ebene (Level 2) berechnet, wobei Level 1 die Fluktuationen zwischen Messzeitpunkten repräsentiert und Level 2 die stabileren Unterschiede zwischen Personen. Auf Level 1 wurden für jede Person jeweils ein idiografisches Netzwerk der zero-order-Korrelationen, sowie ein idiografisches Netzwerk der Partialkorrelationen (kontrolliert für den Einfluss aller anderen Variablen im Netzwerk) berechnet. Anschließend wurde mit verschiedenen Methoden überprüft (orientiert an der GIMME-Methode, Beltz et al., 2016), welche der individuellen Pfade sich wie oft über Personen hinweg generalisieren ließen. Aus den generalisierbaren Pfaden wurde das nomothetische Netzwerk (jeweils getrennt für zero-order- und Partialkorrelationen) erstellt.
Die Erkenntnisse wurden jeweils zwischen den beiden kovarianz-basierten Methoden verglichen, da beide Methoden sich hinsichtlich ihrer Erkenntnisse oft unterscheiden (z.B. Jähne et al., in prep.; Kulakow et al., in prep.). Deren Ergebnisse wurden den Ko-Okkurrenz-Netzwerken gegenübergestellt, um herauszufinden, wie oft welche Motivations- und Emotionsfacetten gemeinsam bejaht wurden (siehe Moeller et al., 2018). Der Vergleich der drei Modelltypen liefert differenzielle Einsichten, untersucht das Modell empirisch und bekräftigt das Modell.
|