Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

 
 
Session Overview
Session
S6: Statistical Modeling I
Time:
Monday, 04/Sept/2023:
11:00am - 12:40pm

Session Chair: Sereina Herzog
Session Chair: Achim Guettner
Location: Seminar Room U1.197 hybrid


Show help for 'Increase or decrease the abstract text size'
Presentations
11:00am - 11:20am

Modeling the Ratio of Gamma Distributed Random Variables using Frank's Copula

Moritz Berger1, Nadja Klein2, Matthias Schmid1

1Institute of Medical Biometry, Informatics and Epidemiology, Medical Faculty, University of Bonn; 2Chair of Uncertainty Quantification and Statistical Learning, Research Center for Trustworthy Data Science and Security (UA Ruhr) and Department of Statistics (Technical University Dortmund)

In clinical and epidemiological studies one frequently encounters the ratio of two possibly correlated components. Typical examples are, among others, the LDL/HDL cholesterol ratio in cardiovascular research, the CD4/CD8 ratio in HIV research and the GEFC/REFC ratio in fundus autofluorescence imaging. In regression analysis with a ratio outcome, a reasonable assumption is that the two components follow a gamma distribution each, thereby accounting for the positivity of the component values and the skewness of their distributions. If independence between the two components can be assumed, the ratio of two gamma distributed variables follows a generalized beta distribution of the second kind (GB2; Kleiber and Kotz, 2003). Several regression approaches for the GB2 distribution have been proposed recently. For positively correlated components, Berger et al. (2019) developed a regression model based on Kibble’s bivariate gamma distribution, where one of the parameters is directly interpretable in terms of the Pearson correlation coefficient between the two components. Regarding the ratio of two negatively correlated components no regression modeling strategy exists so far.

To address this issue, we propose a regression model where the joint bivariate distribution of the two gamma distributed random variables is given by Frank’s copula (Genest, 1987). The model explicitly accounts for a negative (or positive) correlation between the two components. It also allows for different forms of the two marginal distributions with possibly unequal rate and shape parameters. The probability density function of the ratio conditional on covariate values and distributional parameters of interest can be derived in a very flexible way. We illustrate the approach analyzing data from dementia research, where cerebrospinal fluid biomarkers are used for early diagnoses of Alzheimer’s disease. In this application, measurements of the amyloid-beta 42 protein and total tau protein exhibit a clearly negative correlation.

  • M. Berger, M. Wagner, and M. Schmid. Modeling biomarker ratios with gamma distributed components. The Annals of Applied Statistics, 13:548–572, 2019.
  • C. Genest. Frank’s family of bivariate distributions. Biometrika, 74:549–555, 1987.
  • C. Kleiber and S. Kotz. Statistical Size Distributions in Economics and Actuarial Sciences. Wiley, Hoboken, 2003.


11:20am - 11:40am

Simplifying complex models: deselection for boosting distributional copula regression

Annika Strömer1, Nadja Klein2, Christian Staerk1, Hannah Klinkhammer1, Andreas Mayr1

1Department of Medical Biometrics, Informatics and Epidemiology, University Hospital Bonn, Germany; 2Chair of Uncertainty Quantification and Statistical Learning, Research Center Trustworthy Data Science and Security (UA Ruhr) and Department of Statistics (Technische Universität Dortmund)

Boosting distributional copula regression is a useful and flexible tool to jointly model multivariate outcomes, in which all parameters of the joint response distribution are related to covariates via additive predictors. Estimating and selecting the model through model-based boosting provides several useful features, such as the ability to model high-dimensional data situations. Additionally, boosting can incorporate data-driven variable selection simultaneously for all parameters of the marginal distributions as well as the association parameter of the copula. However, as known from univariate (distributional) regression models, the algorithm tends to select too many variables, particularly for low-dimensional settings (p < n). In these situations, the algorithm exhibits slow overfitting behaviour, resulting in the inclusion of many variables that have only minor importance and thus overall to a large model with difficult interpretation.

To counteract this behaviour, we adapt a recent deselection approach for statistical boosting to multivariate (copula) regression models to deselect base-learners with only a negligible impact on the overall performance of the model.

In a simulation study, we evaluate the performance of our deselection approach and additionally compare it to well-known methods to enhance variable selection such as stability selection and probing. All approaches effectively reduce the number of false positives. However, probing results in a lower predictive performance compared to the classical boosted model but with the smallest runtime. Stability selection and our deselection approach lead to a similar predictive performance as the classical approach whereas stability selection has the longest computational time. The latter renders stability selection infeasible for high-dimensional data.

Furthermore, we illustrate our deselection approach on high-dimensional genomic cohort data from the UK Biobank by modelling the joint genetic predisposition of two continuous phenotypes. Both outcomes are not only non-Gaussian distributed but also have an association that differs depending on the observed predictor variables, which justifies the need of a distributional copula regression model. Our results suggest that the approach is able to reduce the model complexity (improving therefore interpretability) and still leads to comparable results in terms of predictive performance.



11:40am - 12:00pm

Random graphical model of microbiome interactions in related environments

Veronica Vinciotti1, Ernst Wit2, Francisco Richter2

1University of Trento, Italy; 2Università Svizzera italiana, Switzerland

Multivariate data are typically collected under different environments, such as different biological conditions or time points. The interest is often to discover the dependencies between the variables that are specific to each environment as well as structural similarities between the environments. We propose a computational approach for the joint inference of graphical models from different environments. A random graph generative model is introduced to capture relatedness at the structural level across the different environments. In addition, the model allows for the inclusion of external covariates at both the node and interaction levels, further adapting to the richness and complexity of high dimensional data from many application areas. We consider closely the inference of microbiota systems from metagenomic data for a number of body sites.



12:00pm - 12:20pm

Methods of Model selection for models with common parameters

Onur Gül, Kirsten Schorning

TU Dortmund, Germany

The analysis of gene-expression data leads to a high-dimensional statistical problem where thousands of concentration-response data have to analysed. For instance, the concentration-response data provided in the Valproic acid (VPA) data set the information about the concentration-response relationship of more than 20.000 genes. Fitting each of these concentration-response data separately to a non-linear model leads to a complex model with many parameters and a corresponding high-dimensional estimator with high variance.

Assuming that some genes behave similarly and that the corresponding concentration-response data can be fitted by non-linear models with common parameters, can reduce the number of unknown parameters substantially. In particular, it might be reasonable that the concentrations at which 50% of the maximum effect is achieved (EC_{50}) are at least similar for some genes and therefore these parameters can be assumed to be the same across the considered non-linear models. This assumption causes a reduction of the variance of the lower-dimensional parameter estimator, but also a bias, as the assumed shared parameters are only similar, but not the same.

In this talk, we answer the question under which circumstances the less complex model with the additional assumption of common parameters should be used instead of the complex model where all genes are considered separately. More precisely, we derive asymptotic properties of the estimators in each of the models in order to calculate the asymptotic mean squared errors. Based on the asymptotic, we derive a model selection criterion which selects the model (with common parameters) leading to the smallest mean squared error.

We show in a simulation study that the derived model selection criterion performs well in comparison to other common selection criteria. Moreover, we apply the developed model selection criterion to the VPA data set in order to estimate the EC_{50}.



12:20pm - 12:40pm

Comparing statistical methods for analyzing longitudinally measured ordinal outcomes in rare disease settings.

Martin Geroldinger1,2, Johan Verbeeck3, Konstantin E. Thiel1,2, Geert Molenberghs3,4, Arne C. Bathke5, Martin Laimer6, Georg Zimmermann1,2

1Team Biostatistics and Big Medical Data, IDA Lab Salzburg, Paracelsus Medical University Salzburg, Austria; 2Department of Research and Innovation, Paracelsus Medical University, Salzburg, Austria; 3Data Science Institute (DSI), Interuniversity Institute for Biostatistics and Statistical Bioinformatics (I-BioStat), Hasselt University, Belgium; 4Interuniversity Institute for Biostatistics and Statistical Bioinformatics (I-BioStat), KULeuven, Belgium; 5Intelligent Data Analytics (IDA) Lab Salzburg, Department of Artificial Intelligence and Human Interfaces, Faculty of Digital and Analytical Sciences, Paris Lodron University of Salzburg, Austria; 6Department of Dermatology and Allergology, Paracelsus Medical University, Salzburg, Austria

Ordinal data in a repeated measures design of a cross-over study for rare diseases usually do not allow for the use of standard parametric methods. Hence, nonparametric methods should be considered instead. Determination of an appropriate nonparametric approach is likewise challenging, as only limited simulation studies for complex trial designs with very small sample sizes exist. Referring to a cross-over trial for the genodermatosis epidermolysis bullosa, a rank-based approach using the R package nparLD and different generalized pairwise comparisons (GPC) methods were assessed neutrally in a comparative simulation study. The results revealed no single best method for this particular design, since a trade-off became apparent between achieving high power, accounting for period effects, and controlling for missing data. Specifically, nparLD as well as unmatched GPC approaches did not address cross-over aspects, and the univariate GPC variants partly ignored longitudinal information. The matched GPC approaches, on the other hand, took the cross-over effect into account in the sense of incorporating the within-subject association. Overall, the prioritized unmatched GPC method achieved the highest power in the simulation scenarios, although this may be due to the specified prioritization. The rank-based approach yielded good power even at a sample size of N=6, while the matched GPC method could not control the type I error. Together with the results from extensive simulation studies using binary and count outcome data, our findings will add to the development of recommendations and educational materials which will be disseminated in the statistical as well as in the clinical scientific community. Thereby, the accurateness of methodological approaches of clinical research in rare diseases should be increased.

(This research has been conducted within the framework of the EBStatMax project, which is funded by the European Joint Programme on Rare Diseases, EU Horizon 2020 grant no. 825575. The Authors gratefully acknowledges the support of the WISS 2025 project 'IDA-Lab Salzburg' (20204-WISS/225/197-2019 and 20102-F1901166-KZP).)



 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: CEN 2023
Conference Software: ConfTool Pro 2.6.149+TC
© 2001–2024 by Dr. H. Weinreich, Hamburg, Germany