Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

 
 
Session Overview
Session
S60: Variable selection
Time:
Thursday, 07/Sept/2023:
8:30am - 10:10am

Session Chair: Hans Ulrich Burger
Session Chair: Fred Sorenson
Location: Lecture Room U1.101 hybrid


Show help for 'Increase or decrease the abstract text size'
Presentations
8:30am - 8:50am

Do we need different variable selection procedures depending on the goal of the statistical model?

Theresa Ullmann, Georg Heinze, Daniela Dunkler

Medical University of Vienna, Center for Medical Data Science, Section for Clinical Biometrics, Vienna, Austria

Data-driven variable selection is frequently performed in statistical modeling, i.e., when modeling the associations between an outcome and multiple independent variables. Variable selection may improve the interpretability, parsimony and/or predictive accuracy of a model (Heinze et al. 2018, doi:10.1002/bimj.201700067). Many different methods for variable selection have been proposed, such as selection based on significance criteria (e.g., backward elimination), or methods based on penalized likelihoods (e.g., the LASSO).

Less attention has been given to the fact that the specific purpose of variable selection depends on the goal of modeling. Shmueli (2010, doi:10.1214/10-STS330) distinguished between three main types of statistical models: descriptive, explanatory, and predictive. In descriptive modeling, researchers aim to describe the relationship between the outcome and the independent variables in a parsimonious manner. Here, variable selection may help to generate simple and interpretable models. In explanatory modeling, researchers are interested in estimating the causal effect of a specific explanatory variable, often an intervention, on the outcome adjusted for confounders. The confounders are typically chosen a priori based on domain expertise (e.g., with the help of directed acyclic graphs). Still, researchers might expect data-driven variable selection to increase the precision of the effect estimate by eliminating confounders with negligible association with the outcome. Finally, in predictive modeling, the main goal is to predict the outcome as accurately as possible. Here, variable selection may help to remove noise and thus reduce the prediction error. Sometimes, modeling has multiple goals, e.g., to find a descriptive model that is also suitable for prediction. In such situations, variable selection must serve multiple purposes.

In this talk, we will first discuss variable selection in the context of different modeling goals. Then we will present the results of a simulation study where we evaluated different variable selection methods, including backward elimination, the LASSO, and others. Multivariable data is simulated based on real-world data from the National Health and Nutrition Examination Survey (NHANES). Different sample sizes and R2 are considered. We evaluate the results according to various performance criteria (e.g., the effect of the selection on the bias and variance of the coefficient estimation, the effect on the prediction error, as well as the selection rate of the ‘true’ model). In the interpretation of the results, we put a particular focus on which estimands and performance criteria are most relevant for which modeling goals. For example, in explanatory modeling, the effect estimator is the main estimand, whereas in descriptive modeling, a particular focus is on selecting the ‘true’ model, or at least not missing the most relevant variables. For any method, there is a strong association of any type of performance with sample size and the underlying R2. Consequently, the choice of a variable selection method should take into account knowledge and assumptions about these main drivers of performance, but also the modeling goal. Our talk encourages data analysts to think carefully about their modeling goals before planning their modeling analysis.

This work was supported through the Austrian Science Fund FWF [project I-4739-B].



8:50am - 9:10am

A neutral comparison of algorithms to minimize L0 penalties for high-dimensional variable selection

Florian Frommlet

Medical University Vienna, Austria

Variable selection methods based on L0 penalties have excellent theoretical properties to select sparse models in a high-dimensional setting. There exist modifications of BIC which either control the family wise error rate (mBIC) or the false discovery rate (mBIC2) in terms of which regressors are selected to enter a model. However, the minimization of L0 penalties comprises a mixed integer problem which is known to be NP hard and therefore becomes computationally challenging with increasing numbers of regressor variables. The last few years have seen some real progress in developing new algorithms to minimize L0 penalties. Simulation studies covering a wide range of scenarios which are inspired by genetic association studies as well as a real data example concerned with eQTL mapping are used to compare the performance of some of these algorithms. The study results in a clear recommendation which algorithms to use in practice.



9:10am - 9:30am

Effects of Influential Points and Sample Size on the Selection and Replicability of Multivariable Fractional Polynomial Models

Willi Sauerbrei, Edwin Kipruto

Medical Center - University of Freiburg, Germany

Background: The multivariable fractional polynomial (MFP) approach combines variable selection using backward elimination with a function selection procedure (FSP) for fractional polynomial (FP) functions. It is a relatively simple approach which can be easily understood without advanced training in statistical modelling. For continuous variables, a closed test procedure is used to decide between no effect, linear, FP1 or FP2 functions. Influential points (IPs) and small sample sizes can both have a strong impact on a selected function and MFP model.

Methods: We used simulated data with six continuous and four categorical predictors to illustrate approaches which can help to identify IPs with an influence on function selection and the MFP model. Approaches use leave-one or two-out and two related techniques for a multivariable assessment. In eight subsamples we also investigated the effects of sample size and model replicability, the latter by using three non-overlapping subsamples with the same sample size. For better illustration, a structured profile was used to provide an overview of all analyses conducted.

Results: The results showed that one or more IPs can drive the functions and models selected. In addition, with small sample size, MFP was not able to detect some non-linear functions and the selected model differed substantially from the true underlying model. However, when the sample size was relatively large and regression diagnostics were carefully conducted, MFP selected functions or models that were similar to the underlying true model.

Conclusions: For smaller sample size, IPs and low power are important reasons that the MFP approach may not be able to identify underlying functional relationships for continuous variables and selected models might differ substantially from the true model. However, for larger sample sizes a carefully conducted MFP analysis is often a suitable way to select a multivariable regression model which includes continuous variables. In such a case, MFP can be the preferred approach to derive a multivariable descriptive model.



9:30am - 9:50am

Post-estimation shrinkage in full and selected linear regression models in low-dimensional data revisited

Edwin Kipruto, Willi Sauerbrei

Medical Center University of Freiburg, Germany

The fit of a regression model to new data is often worse than its fit to the original data due to overfitting. Analysts often employ variable selection techniques when developing a regression model, which can lead to biased estimates. To address overfitting and reduce the bias of estimates induced by variable selection, shrinkage methods have been proposed. Selected variables whose true effects are small are prone to selection bias and can benefit from shrinkage, while variables with large effects generally require little or no shrinkage. Post-estimation shrinkage is a two-step alternative to penalized regression methods that does not rely on optimizing any criteria under specific constraints and can be easily applied to generalized linear models and regression models for survival data. In the context of a full model and aiming to derive a good predictor, global shrinkage was proposed. For selected models it was extended to parameterwise shrinkage (PWSF). Van Houwelingen and Sauerbrei (2013) conducted a simulation study to compare these two approaches with Lasso, but only for few scenarios with moderate to large signal-to-noise (SNR) ratio and low correlation.

Within the framework of a classical linear regression model, we conducted a simulation study with a much broader scope, specifically concerning the amount of information in the data. We assessed whether post-estimation shrinkage methods can improve full and selected models and compared the results with ridge (in full models) and Lasso (in selected models). We also proposed a modified version of PWSF called nonnegative PWSF (NPWSF) to address the weaknesses of PWSF in full models. We investigated prediction errors, bias of estimates, and model sparsity. The results indicate that the performance of methods is influenced by the amount of information in the data, and none of the methods performed best in all scenarios. Post-estimation shrinkage methods can improve the prediction accuracy of both full and selected models and reduce the bias of regression estimates for selected variables.

In full models, PWSF generally performed poorly, while global shrinkage performed similarly to NPWSF in low SNR. However, in moderate to high SNR, NPWSF outperformed global shrinkage. In addition, NPWSF performed better than ridge in low correlation with moderate to high SNR. In selected models, all post-estimation shrinkage methods performed similarly, with global shrinkage being slightly inferior. Lasso outperformed all post-estimation shrinkage methods in low SNR and high correlation but was inferior in low correlation with high SNR.

Our study suggests that, provided the amount of information is not too small, NPWSF is more effective than global shrinkage in improving the prediction performance of both full and selected models. However, in high correlation or very low SNR, penalized methods appear to outperform post-estimation shrinkage methods.

van Houwelingen, H. C., and Sauerbrei, W. (2013). Cross-validation, shrinkage and variable selection in linear regression revisited.



9:50am - 10:10am

High-Dimensional Variable Selection for Competing Risks with Cooperative Penalized Regression

Lukas Burk1,2,3, Andreas Bender2,3, Marvin N. Wright1

1Leibniz Institute for Prevention Research and Epidemiology - BIPS, Bremen; 2Department of Statistics, LMU Munich; 3Munich Center for Machine Learning, LMU Munich

Variable selection is an important step in the analysis of high-dimensional omics data, yet there are limited options for survival outcomes in the presence of competing risks. Commonly employed penalized Cox regression considers each event type separately through cause-specific models, neglecting possibly shared information between them.

We adapt the feature-weighted elastic net (fwelnet), a generalization of the elastic net algorithm, to survival outcomes and competing risks. For two causes, our proposed algorithm fits two alternating cause-specific fwelnet models, where each model receives the coefficient vector of the complementary model as prior information. We dub this “cooperative penalized regression”, as this approach enables the modeling of competing risk data with cause-specific models while accounting for shared information between causes. Predictors that are shrunken towards zero in the model for the first cause will receive larger penalization weights in the model for the second cause and vice-versa. Through multiple iterations, this process ensures a stronger penalization of uninformative predictors in both models.

We demonstrate our method’s variable selection capabilities on simulated genomics data and real-world bladder cancer microarray data. We evaluate selection performance using the positive predictive value (PPV) and false positive rate (FPR) for the correct selection of informative features. The benchmark compares results with cause-specific penalized Cox regression, random survival forests, and likelihood-boosted Cox regression (CoxBoost). Results show cooperative penalized regression to yield higher PPV and lower FPR in settings where mutual information is present, which indicates that our approach is more effective at selecting informative features, while being less likely to select uninformative features. In settings where no mutual information is present, variable selection performance is similar to cause-specific penalized Cox regression.



 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: CEN 2023
Conference Software: ConfTool Pro 2.6.149+TC
© 2001–2024 by Dr. H. Weinreich, Hamburg, Germany