Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

 
 
Session Overview
Session
S3: Prediction models
Time:
Monday, 04/Sept/2023:
11:00am - 12:40pm

Session Chair: Marvin N. Wright
Session Chair: Mouna Akacha
Location: Lecture Room U1.141 hybrid


Show help for 'Increase or decrease the abstract text size'
Presentations
11:00am - 11:20am

Calibrating machine learning approaches for probability estimation: a comparison

Max Louis Jansen1, Francisco Miguel Ojeda2, Alexandre Thiéry1, Stefan Blankenberg1, Christian Weimar3,4, Matthias Schmid5, Andreas Ziegler1,2,6

1Cardio-CARE, Medizincampus Davos, Graubünden, Switzerland; 2University Heart & Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany; 3BDH-Klinik Elzach, Baden-Württemberg, Germany; 4Institute for Medical Informatics, Biometry and Epidemiology, University of Duisburg-Essen, North Rhine-Westphalia, Germany; 5Institute of Medical Biometry, Informatics and Epidemiology, University of Bonn, North Rhine-Westphalia, Germany; 6School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, South Africa

Statistical prediction models have gained popularity in applied research. One challenge is the transfer of the prediction model to a different population which may be structurally different from the model for which it has been developed. An adaptation to the new population can be achieved by calibrating the model to the characteristics of the target population, for which numerous calibration techniques exist. In view of this diversity, we performed a systematic evaluation of various popular calibration approaches used by the statistical and the machine learning communities for estimating two-class probabilities. In this presentation, we present the results of a comprehensive simulation study and an application to real data. The calibration approaches are compared with respect to their empirical properties and relationships, their ability to generalize precise probability estimates to external populations and their availability in terms of easy-to-use software implementations. Calibration methods that estimated one or two slope parameters in addition to an intercept consistently showed the best results in the simulation studies. Calibration on logit transformed probability estimates generally outperformed calibration methods on non-transformed estimates. In case of structural differences between training and validation data, re-estimation of the entire prediction model should be outweighted against sample size of the validation data. We recommend regression-based calibration approaches using transformed probability estimates, where at least one slope is estimated in addition to an intercept for updating probability estimates in validation studies.



11:20am - 11:40am

Advanced statistical modelling for polygenic risk scores by incorporating alternative loss functions

Hannah Klinkhammer1,2, Christian Staerk1, Carlo Maj3, Peter Krawitz2, Andreas Mayr1

1Institute for Medical Biometry, Informatics and Epidemiology, University Hospital Bonn, Germany; 2Institute for Genomic Statistics and Bioinformatics, University Hospital Bonn, Germany; 3Center for Human Genetics, Philipps University Marburg, Germany

In clinical genetics, it is of interest to predict a trait or phenotype based on the patient's genetic information. Polygenic risk scores (PRS) are based on common genetic variants with low to medium effect sizes and aim to capture this genetic predisposition. As genotype data are high-dimensional in nature, from a technical perspective it is crucial to develop algorithms that can be applied on large-scale data (large n and large p). A wide range of PRS methods focus on summary statistics from genome-wide association studies (GWAS) based on univariate effect estimates and combine them to a single score (e.g. PRScs, LDpred2, lassosum). More recently, methods have been developed that can be applied directly on individual-level genotype data to model the variants’ effects simultaneously (e.g. BayesR, snpnet). In this context, we introduced snpboost, a framework that applies statistical boosting on individual-level genotype data to estimate PRS directly via multivariable regression models. By iteratively working on batches of variants, snpboost can deal with large-scale cohort data, e.g. from the UK Biobank.

As the technical obstacles are therefore solved, the methodological scope can be now broadened – focusing on the objectives that are really key for the clinical application of PRS. Similar to many other methods, so far, also snpboost has focused solely on quantitative and binary traits based on common loss functions such as the squared error and logistic loss functions. Exploiting the modular structure of statistical boosting, we now incorporated alternatives. As the loss function defines the type of regression problem that is optimized, we effectively extended the snpboost framework to further data situations such as time-to-event and count data. Furthermore, alternative loss functions allow us to focus not only on the mean of the conditional distribution but also on other aspects that may more helpful in the risk stratification of individual patients. In particular, we illustrate two main applications:

First, for time-to-event data types, it is of interest to stratify the lifetime risk with respect to the genetic predisposition, e.g. to implement earlier preventive examinations. In the field of PRS modelling it is common practice to derive a PRS for the binary response of the occurrence of a disease and, in a second step, incorporate this PRS in a Cox proportional hazard model to stratify lifetime risk. In contrast to this approach, we specifically model time-to-event data by using appropriate loss functions (i.e. weighted L2-boosting or Cox proportional hazard models) and show that optimizing the PRS directly with respect to the aim of predicting the course of the disease is favorable for time-to-event data. Secondly, we include a loss function to fit quantile regression based on the snpboost framework. While most commonly-used methods only provide point estimates for a trait, quantile regression enables us to construct individual prediction intervals quantifying the uncertainty of the prediction for a single patient. Furthermore, quantile regression includes median regression as a special case. Median regression which is a robust alternative to classical mean regression and might be more suitable for traits with outlier measurements.



11:40am - 12:00pm

Comparison of classic polygenic scores with machine learning algorithms to predict blood pressure

Tanja K. Rausch, Silke Szymczak, Inke R. König

Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany

Blood pressure is a frequently measured clinical parameter, with hypertension being the leading risk factor for the development of cardiovascular disease. Based on the polygenic heritability shown for complex traits like blood pressure, polygenic scores (PGS) are increasingly being used in preclinical and clinical research to stratify individuals according to their genetic susceptibility for targeted prevention, therapy, or prognosis. However, classical PGS use a simple sum of individual genotypes, weighted by the association estimated from single variant genome-wide association studies. Thus, multivariable and non-linear effects are not taken into account. Alternatively, machine learning algorithms can be used for such a score construction.

Machine learning algorithms have not yet been applied to construct polygenic scores to predict blood pressure. Therefore, it is unclear whether more complex algorithms are better able to predict blood pressure than classical scores. This study aims to compare this by using different machine learning algorithms suitable for regression problems such as random forest, linear regression, support vector regression, and k-nearest neighbors regression. For the benchmarking, data from the UK Biobank was used, which is a biomedical database containing genetic and health information from half a million participants from the United Kingdom. The data set was split into a training and a test data set. The training data set was used to generate a simple weighted PGS for blood pressure by performing a genome-wide association study. Moreover, it was used to train different more complex machine learning algorithms. Hyperparameter tuning was performed as well as variable selection where applicable. Prediction performance of the resulting models were compared on the independent test data set by the mean squared error (MSE) and the coefficient of determination (R2).

The study results provide better insight into whether compressed genetic information obtained by complex machine learning algorithms perform better than classical PGS to predict blood pressure.



12:00pm - 12:20pm

Investigating different numbers of variants in polygenic scores using the ALLIANCE cohort

Lisa-Marie Nuxoll1, Lea Louisa Kronziel1,2, Inke R. König1,2

1Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany; 2Airway Research Centre North (ARCN), Member of the German Centre for Lung Research (DZL), Lübeck, Germany

A polygenic score (PGS) can be used to estimate an individual’s genetic liability to a trait or disease. For this score, the individual’s genotype information is weighted with results from a genome-wide association study (GWAS) to calculate an individual score for the observed trait or disease. As sample sizes of GWAS increase, PGS may become more powerful in the near future and will be valuable in personalized medicine. However, the value and usefulness of PGS depend on method development to construct PGS, proper use of PGS in analysis and appropriate interpretation of results.

To date, new PGS for various traits are constantly being developed and published. Importantly, new PGS may be developed for a particular trait (e.g., lung function), even though PGS are already available for that trait. These PGS for the same trait might then differ in terms of the selection of variants and/or in the number of variants they contain. The number of variants in a PGS is likely to affect the strength of association with the trait and may also play a role in subsequent clinical applications. Thus, a larger number of variants in a PGS may provide a more precise prediction, but, at the same time, may have the disadvantage that many required variants may not be present in the target dataset, making replication challenging. In addition, a larger number of variants in a PGS may lead to overfitting. This trade-off has not been analyzed systematically.

Therefore, our study investigates the effect of the number of variants in previously published PGS on prediction performance for the lung function traits FEV1/FVC [1,2] and FEV1 [2]. The PGS were calculated for participants of a German pediatric asthma cohort (ALLIANCE) [3] including 526 children with asthma and 249 children without asthma. The considered PGS use 279 variants [1], 1,713,430 and 1,232,916 variants [2]. After calculating the PGS for ALLIANCE participants, the distributions of the PGS were examined and various association analyses were performed. Additional clinical variables were also considered, as models with clinical and genetic information can provide higher accuracy than models containing only genetic information.

The calculation of the PGS published by Moll et al. was found to be problematic because they contain many variants that are not present in the available genetic data of the ALLIANCE cohort. Therefore, an extensive proxy search had to be performed, which may lead to less accuracy and more bias. Subsequent association analyses showed no associations between the observed PGS by Shrine et al. and the asthma phenotypes in the ALLIANCE cohort. However, since the number of variants in the considered PGS is very different, no conclusion about overfitting can be drawn.

Literature:

  1. Shrine, N., Guyatt, A.L., Erzurumluoglu, A.M. et al. 2019 Nat Genet 51:481–93; doi: https://doi.org/10.1038/s41588-018-0321-7
  2. Moll, M., Sakornsakolpat, P. et al. 2020 Lancet Respir Med 8: 696–708; doi: https://doi.org/10.1016/S2213-2600(20)30101-6
  3. Fuchs, O., Bahmer, T., Weckmann, M. et al. 2018 BMC Pulm Med 18:140; doi: https://doi.org/10.1186/s12890-018-0705-6


12:20pm - 12:40pm

Similarity as a basis for data pooling - Improving Local Prediction Models Using External Data

Max Behrens1, Maryam Farhadizadeh1, Astrid Pechmann3, Janbernd Kirschner3, Angelika Rohde2, Daniela Zöller1

1Institute of Medical Biometry and Statistics, University of Freiburg, Germany; 2University of Freiburg, Department of Mathematical Stochastics, Freiburg im Breisgau, Germany; 3Department of Neuropediatrics and Muscle Disorders, Faculty of Medicine, Medical Center – University of Freiburg

Combining data from different sites can provide a larger and more diverse dataset for analysis, which can lead to improved prediction models. For example, when studying a rare disease, a single site may not have enough patients to build a reliable prediction model. However, differences in patient care, case mix, and other factors can pose significant challenges for including data from other data sites. Specifically, this heterogeneity may introduce bias and result in a decreased prediction performance for the target site population when not addressed properly. Further challenges arise when the sample sizes are small, making approaches with a high number of parameters unsuitable. For example, we consider data from the SMArtCARE registry on patients diagnosed with the rare genetic disease spinal muscular atrophy (SMA). Treatment and disease progress evaluation include physiotherapy, which highly depends on the data site, requiring site-specific prediction models for the time to reach a mobility milestone in SMA patients or the mobility score at a specific time point. To address this problem, we propose to quantify the similarity between the target and the external site and to employ this information for including external sites in a weighted manner.

Specifically, we propose to estimate the probability of an individual belonging to the target site using pairwise logistic regression models and to use this probability to assign a higher weight to external individuals similar to the target site individuals than less similar external individuals when building the prediction model. This process is repeated for all pairwise comparisons between the target site and each of the external sites. To incorporate multiple external sites, we standardize the weights across all of them. Since we use weights, this approach can be easily applied to different types of outcomes and prediction models.

In addition to demonstrating the approach using the SMArtCARE registry data, we will evaluate our proposed method using an extensive simulation study, also comparing it to classical approaches like mixed models and regression models with interactions. We demonstrate that the proposed method can overcome the challenges posed by heterogeneity between sites in multi-site data settings and improve the prediction performance of models for a target site. Our approach quantifies similarity between sites using logistic regression and incorporates this information to include external data when building prediction models.



 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: CEN 2023
Conference Software: ConfTool Pro 2.6.149+TC
© 2001–2024 by Dr. H. Weinreich, Hamburg, Germany