Current measurement issues in the Programme for International Student Assessment (PISA)
The Programme for International Student Assessment (PISA) latest results were published in December 2018. This seventh round of PISA builds upon on 20 year of experience in international large scale assessments, especially internationally comparable measurement approaches. The contributors to this symposium have been involved in the development and analysis of the PISA measures as international contractors and national experts in Germany for more than a decade. The symposium will highlight latest developments and discuss current measurement issues from a methodological point of view.
Designed to assess and compare learning contexts and cognitive outcomes of 15 year olds around the world, the latest study features new methods and measurement approaches: Computer-based assessment was introduced in PISA 2015 already, opening up the discussion about potential mode-effects and influences on trend comparison over time. PISA 2018 saw adaptive testing approaches which added additional challenges to international comparison. Besides cognitive outcomes, the study describes learning settings around the world and tries to relate it to students’ achievements. To allow for comparison of education systems, context questionnaire scales need to be evaluated regarding their measurement invariance. This includes newly developed measures of self-efficacy and self-concept related to the major domain of reading.
This symposium combines four presentations highlighting different aspects of educational measurement in international large scale studies. It will point out significant changes in assessment design and analytical methods over the last cycles, discussing challenges for international large scale assessment.
The presentation by Dominique Lafontaine and Nina Jude elaborates on the process of developing questionnaire scales assessing different dimensions of self-efficacy. They will elaborate on all steps of the developing process, including the evaluation of dimensionality for the scales based on theoretical assumptions, testing of invariance and the predictive validity across all 80 countries participating in PISA. The question of measurement invariance in context measures is further discussed in the presentation by Janine Buchholz. She will provide a comprehensive overview of different approaches on detecting measurement invariance, and focus on latest findings using the Generalized Partial Credit Model for scaling the PISA 2018 questionnaire data.
The second part of the symposium will present results from a German add-on study that was conducted by the Centre for International Student Assessment (ZIB) in PISA 2018. The add-on study investigates the mode change from paper-based to computer-based assessment that happened in PISA 2015 and addresses questions of comparability and trend estimation in-depth. The presentation by Scott Harrison and colleagues will investigate the construct equivalence between paper-based and computer-based assessment by comparing the influence of construct-relevant item characteristics on item difficulty. The presentation by Alexander Robitzsch and colleagues will address how the mode change may have affected the comparability with the results of earlier PISA rounds in Germany. Finally, Claus Carstensen will discuss all four contributions and share his view on the PISA measurement issues highlighted in this symposium.
Beiträge des Symposiums
Developing measures for self-concept and self-efficacy in reading for PISA 2018
Context and state of the art
In PISA 2018, reading was for the third time the major domain. A new reading framework has been developed to address the differences between print and online reading (Afflerbach & Cho, 2010). In parallel, all the reading non-cognitive constructs in the contextual questionnaires have been revisited; new scales have been developed to cover missing constructs in the previous cycles and aspects linked with online reading.
Because self-efficacy and self-concept are important motivational attributes and proved to be strong correlates of reading achievement (Baker & Wigfield, 1999; Marsh & Craven, 2006; Solheim, 2011, Morgan & Fuchs, 2007; Retelsdorf, Köller, & Möller, 2011), a self-concept and a self-efficacy scale have been developed for the PISA 2018 students’ questionnaire.
Typically, self-efficacy (Bandura, 1997) refers to the individual’s perceived capacity of doing specific tasks, whereas self-concept is a general measure of the individual’s own perceived abilities related to a domain (i.e. reading) (Marsh & Craven, 1997). The scales were successfully tested in the PISA-Field trial in 2017 and implemented in the Main Survey in 80 participating countries.
Following Chapman and Tunmer’s recommendations (1995), the self-concept scale comprises perceptions of competence in reading (3 items f.i. I am a good reader) and of difficulty in reading (3 items f.i. I always had difficulties with reading).
The self-efficacy scale comprises four items, one positively and three negatively oriented. Students were asked to consider the reading part of the PISA test and to evaluate their capacity to perform the test (f.i. I understood most of the texts, I was lost when I had to navigate between different pages). To our knowledge, it is the first time a reading self-efficacy scale is developed for reading in comparative studies. Many studies claimed having self-efficacy measures, but most of these scales are in fact self-concept measures (Schiefele, Schaffner, Möller, & Wigfield, 2012).
Aims of the study
The aims of the study were to validate the new self-concept and self-efficacy scales of PISA 2018. More specifically, we wanted to test:
- whether the self-efficacy scale measures a specific construct distinct from the self-concept;
- whether the self-concept scale is unidimensional or bidimensional;
- whether the new scales are cross-culturally invariant and whether an attitudes- achievement-paradox is observed (He & Van de Vijver, 2016);
- to what extent the self-concept and self-efficacy are related to reading proficiency (predictive validity).
The analyses were performed for both OECD countries and partner economies participating in PISA 2018. The quality of scales was evaluated by their internal consistency across countries as well as factor analysis. Moreover, multigroupmodels (MGCFA) were implemented to test cross-cultural invariance. To analyse the attitudes- achievement-paradox, the students’ Plausible Value in reading were used to model relationships both on individual and on country level.
[For the reviewers: Results of PISA 2018 are embargoed until December 2019 thus only technical findings from the Field Trial can be reported a this stage.]
Technical results from the Field Trial showed good scale reliabilities for self-concept and self-efficacy scales for all countries, indicating that these measures can be implemented in an international large scale assessment. The factor analysis showed clearly distinct constructs in all countries, again pointing to a valid measure. Results from IRT scaling and country specific correlations with reading competence will be presented at the conference.
Measurement invariance across the PISA 2018 Questionnaires: A comprehensive overview of findings
Questionnaires for the assessment of constructs such as attitudes, values and beliefs are essential in educational and psychological research. Many international large-scale assessments (ILSAs) such as Programme for International Student Assessment (PISA) aim at comparing these latent constructs between respondents from a large number of participating countries, an endeavor which requires measurement invariance (MI) across all countries to be established. Several statistical approaches have been developed to test for measurement invariance (MI). Of these, Multigroup Confirmatory Factor Analysis (MGCFA; Jöreskog, 1971) was found to be the most common one (e.g., Boer et al., 2018). However, given the large number of groups (i.e., participating countries) in ILSAs, the approach does not prove to be useful for operational application (Rutkowski & Svetina, 2014). In addition, it has been repeatedly noted that MI testing in ILSAs focused almost exclusively on the cognitive part of the assessments (Braeken & Blömeke, 2016; Hopfenbeck et al., 2018). This imbalance undermines the importance of questionnaire data as they contribute to the achievement estimation and allow for the “contextualization” of student performances in participating countries (Rutkowski & Rutkowski, 2010). In fact, a recent literature review on the nature of PISA-related publications demonstrated that the majority of secondary research focused on constructs administered with questionnaires (Hopfenbeck et al., 2018).
For PISA 2015, an innovative approach for testing the invariance of IRT-scaled constructs in the context questionnaires administered to students, parents, school principals and teachers (OECD, 2016) has been implemented. It is the scope of this presentation to provide a comprehensive overview of findings on MI regarding the constructs administered with the questionnaires in PISA 2018 using this relatively new method.
Data pertaining to all scaled constructs in the in the PISA 2018 questionnaires are used for analysis, and MI is tested following the operational procedure in PISA 2015 (OECD, 2017) using mdltm (von Davier, 2005). On the basis of a concurrent calibration with equal item parameters across all groups (i.e., languages within countries) using the Generalized Partial Credit Model (GPCM; Muraki, 1992), a group-specific item-fit statistic (root-mean-square deviance; RMSD) is calculated, thus indicating whether a particular group’s data can be described well by the international parameters. The operational cutoff-criterion in PISA 2015 (i.e., RMSD < .3) is used to determine the presence of MI.
Unfortunately, results can only be presented after the embargo for PISA 2018 and can therefore not be discussed here. In the presentation, the results on MI will be summarized from two different angles: scales and countries. Patterns can then be described with respect to properties of scales (e.g., number of items, content domain) and countries (e.g., geographic region, language group), respectively, providing an insight into MI regarding the PISA 2018 questionnaire scales.
Mode Effect, the PISA assessment framework, and construct equivalence – is there a link?
The term mode effect refers to non-equivalence in psychometric items and tests arising from the mode of test administration. In the context of the 2015 OECD PISA report where mode was altered, mode effects were present, with computer based assessment (CBA) being harder than paper based assessments (PBA) for selected items (OECD, 2016, Appendix 6, p.6). The objective of this study is to further investigate the evidence regarding construct equivalence through a construct representation approach. To do this, the underlying PISA assessment framework is taken into account.
One way to investigate construct validity is the construct representation approach described by Embretson (1983). Construct-relevant facets as defined by the PISA assessment framework are expected to determine item difficulty. If item difficulty can be explained by facets as expected, this provides validity evidence for the construct interpretation of the test score. For construct equivalence across modes, it is expected that this pattern of effects of facets does not change across modes, and thus, if there is empirically no interaction between mode and facet, this provides evidence for construct equivalence across modes (and vice versa).
The PISA 2015 assessment framework is comprised of a number of facets within each domain, three in mathematics, three in reading, and six within science (Vayssettes, 2016). For example, the mathematics domain is divided into three facets, content, situation/context, and process. Within the content facet, it contains four levels, space and shape, quantity, change and relationships, and uncertainty and data. Each item reflects the underlying facets and levels of this analytical framework, and correlates to various aspects as to what the student is required to undertake in answering the question.
The study combines PISA 2015 field trial data from twelve countries to address the main research question:
Is there a relationship between the PISA construct facets represented in the Assessment Framework, and the mode of assessment used by test takers?
For construct equivalence, it is expected that any mode effects will be evenly distributed among the levels of a construct facet, that is a particular facet of the assessment framework determines item difficulty comparably across modes.
The data from the twelve participating countries was pooled to create a sufficient sample to estimate a 2PL model. Sample sizes were: NMaths = 10,017; NReading = 9891; NScience = 9907. Using Mplus (Muthén & Muthén, 2017), a complex mixture model was used, incorporating maximum likelihood estimation and clustering from School ID’s, to estimate item discrimination and difficulty on the IRT scale for both the PBA and CBA questions. The difference in difficulties was estimated (PBA – CBA = Δβ) and then correlated onto the underlying facet levels within each facet independently.
Expected Research Contributions
Preliminary results confirm that all domains experienced mode effected items. Importantly however, not all domains showed a consistent difference between the levels within each facet. For example, Science Facet context 1 relates science items on personal, local/national, or global contexts. Results show that there is a significant mode effect on questions of both a local/national context (Δβ = -0.225, p < 0.001) and a global context (Δβ = -0.294, p < 0.001). However, there was no significant difference between the PBA and CBA questions when correlated to the personal context (Δβ = -0.099, p = 0.069). Differences between facet levels, indicates that mode effects within PISA can be linked to the underlying assessment framework, and warrants further investigation with respects to construct equivalence.
Marginal trend estimation of the PISA 2009 and 2018 trends: Comparison of the German results for computer- vs. paper-based assessments
In PISA 2015, the assessment mode was changed from paper to computer, giving rise to questions of comparability and trend estimation. One of the aims of the national extension study as part of PISA 2018 is to carry out in-depth analyses of the trend for Germany in the domains of reading, mathematics and science. For each domain, the question will be examined to what extent the internationally reported trend estimate (original trend) between PISA 2009 and PISA 2018 differs from a national trend estimate (marginal trend) which is based only on German PISA data without recurrence of international item parameters (Robitzsch et al., 2017). While the change of the test mode from paper to computer carried out in PISA 2015 has to be considered for the interpretation of the original trend estimate from PISA 2009 to 2018, the national extension study allows the estimation of a marginal trend from 2009 to 2018 only based on a paper-based test. The marginal trend provides information on how competencies in Germany have developed over different cohorts of the fifteen-year-old students while the test instruments and test design were held identical.
The estimation of the marginal trend for Germany was carried out using three different scaling methods. For the technical implementation of the scaling, a distinction can be made between a separate scaling with subsequent linking and a concurrent scaling procedure (Kolen & Brennan, 2014). In a separate scaling, the individual PISA surveys (2009, 2012, 2015, and 2018) were first scaled separately and then the actual parameters were transformed to a common metric in a simultaneous linking according to the Haebara method (see Kolen & Brennan, 2014). The 1PL model and the 2PL model were used as scaling approaches. Since the PISA study used a 1PL model as the scaling model until 2012 and then switched to a 2PL model (in 2015), it seemed appropriate to check the sensitivity of the marginal trend estimate with respect to the choice of the scaling model. In addition, a concurrent scaling based on the 2PL model was applied, in which the individual surveys were treated as groups in an IRT multi-group model. For all three methods, only the items (trend items) identified by the OECD as invariant (between PISA 2015 and 2018) were used for the (computer-based) link from 2015 to 2018. All scaling procedures were performed using student weights. One-dimensional models were used for each domain, including only those students to whom items in the domain were administered.
The changes between PISA 2009 and 2018 are slightly smaller for the marginal trend, which is based solely on paper based measurements, than for the original trend. Only minor differences were found between the three different scaling approaches. Moreover, the original and marginal trend for the changes between 2015 and 2018 appear to be very similar.