Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

 
 
Session Overview
Session
S52: Random forests
Time:
Wednesday, 06/Sept/2023:
10:40am - 12:20pm

Session Chair: Anne-Laure Boulesteix
Session Chair: Lilla Di Scala
Location: Lecture Room U1.141 hybrid


Show help for 'Increase or decrease the abstract text size'
Presentations
10:40am - 11:00am

Challenge in distinguishing important from informative variables in random forest prediction models

Césaire J. K. Fouodo, Sike Szymczak

Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany

Random forest (RF) is a well-performing prediction method for high-dimensional data and enables the selection of predictor variables using variable importance measures. The Actual Impurity Reduction (AIR) measure is a computationally efficient and unbiased RF importance measure. Although many RF variable selection procedures are based on importance measures, how to interpret the resulting measurements in relation to the structure of the data and the prediction model is rarely questioned. In most cases, the importance of each variable is interpreted as its ability to improve the model prediction performance and, therefore, analyzed independently from the proportion of the available predictors associated with the response variable. For example, having a large proportion of associated predictors with the response variable in a dataset does not necessarily mean that all of them are important for building a predictive model.

We propose to distinguish important from informative predictor variables. A predictor variable is called informative if it is associated with the response variable. An important variable is an informative variable that substantially improves the model's prediction performance. Therefore, an informative variable can be unimportant if it does not significantly improve the predictive model. Such an unimportant informative predictor variable may be interpreted as a noise variable, although it is associated with the response variable.

We used simulation studies to demonstrate the effects of the proportion of informative variables on the estimated AIR importance of RF. We simulated datasets with non-informative noise variables and different proportions of non-correlated informative predictor variables.

Our results show that estimated AIR decreases when the proportion of informative variables in the dataset increases. We explain why this decrease in the estimated importance can strongly affect variable selection testing procedures. Finally, we expect this study to improve the interpretation of RF variable importance measures.



11:00am - 11:20am

Identifying different tree types based on clustering in random forests

Björn-Hergen Laabs, Lea Louisa Kronziel, Ana Westenberger, Inke R. König

Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany

Since the popularity of machine learning algorithms is ever increasing, methods for opening these black box methods become more and more important. In the case of random forests (RF), most representative trees (MRTs) have shown a big potential to facilitate the interpretation of the complex tree ensembles. The idea of MRTs is that the complete RF is represented by a single selected (S-MRT) or artificially generated tree (A-MRT). Due to their complex structure, it is likely that a single MRT is not able to capture the structure of all trees in the ensemble, especially when the trees in the RF are very diverse. This could, for instance, be the case with latent subgroups in the data, leading to trees that are specific for different subgroups.

Therefore, we propose a two-step procedure that firstly clusters the trees within a RF into different types of trees using a standard cluster algorithm and secondly generates a single MRT for each of the resulting clusters of trees. Thus, we end up with a small ensemble of clustered MRTs (C-MRT), that is better able to cover the diversity of the complete RF. Combined with the methods to obtain the single MRT this leads to either clustered selected MRTs (CS-MRT) or clustered artificial MRTs (CA-MRTs).

In an extensive simulation study, we will compare CA-MRTs and CS-MRTs with the previously described S-MRTs and A-MRTs regarding their prediction performance, ability to condense the information of the ensemble and coverage of the meaningful predictors. We simulate a standard setting including fixed main and interaction effects in high dimensional data where C-MRTs to proof that they are not inferior to normal MRTs as well as a setting where latent subgroups are present in the training data, which should lead to more diverse trees in the RF. Here C-MRTs should clearly outperform standard MRTs.

Additionally, we apply all methods to a genetic data set of X-linked dystonian-parkinsonism (XDP) and discuss the resulting MRTs with regard to recent results on genetic modifiers of age at onset in XDP.

Finally, we will add the new methods to our existing R package timbR (https://github.com/imbs-hl/timbR).



11:20am - 11:40am

Evaluation of network-guided random forest for disease gene discovery

Jianchang Hu, Silke Szymczak

Universität zu Lübeck, Germany

Identification of biomarkers associated with complex diseases can improve patient risk prediction and foster understanding of underlying molecular pathomechanisms. Gene network information is believed to be beneficial for disease module and pathway identification. We investigate the performance of a network-guided random forest (RF) where the network information is summarized into a sampling probability of predictor variables which is further used in the construction of the RF. The identification of important genes is based on standard variable importance measures from RF. In the simulation study, we simulate synthetic RNA-Seq data along with the underlying network structure using the R package SeqNet. Our results suggest that network-guided RF does not provide better disease prediction than the standard RF. In terms of disease gene discovery, when causal genes are randomly distributed within the network, network information only deteriorates the gene selection, but if they form disease module(s), network-guided RF identifies causal genes more accurately. We also find that when disease status is independent from genes in the given network, spurious gene selection results can occur when using network information, especially on hub genes. Two TCGA microarray and RNA-Seq breast cancer datasets with 283 and 284 patients, respectively, along with protein-protein interaction network information from the STRING database are investigated for progesterone receptor (PR) status related gene identification. Both datasets include 193 PR-positive patients. Standard and network-guided RFs can both find out the core genes including PGR and ESR1 on two datasets. In addition, network-guided RF can further identify gene EGFR from the ESR-mediated signaling pathway and gene AR from the gene expression (transcription) pathway; both pathways are PGR-related. This demonstrates the potential gains in disease module and pathway identification by utilizing network information for complex diseases.



11:40am - 12:00pm

Random Survival Forests for Competing Events: A Subdistribution-Based Approach

Charlotte Behning1, Alexander Bigerl2, Marvin Wright3, Moritz Berger1, Matthias C. Schmid1

1Institute of Medical Biometry, Informatics and Epidemiology, University Hospital Bonn; 2DICE Group, Department of Computer Science, Paderborn University, Paderborn, Germany; 3Leibniz Institute for Prevention Research and Epidemiology - BIPS, Bremen, Germany

Random Survival Forests (RSF) can be applied to many time-to-event research questions, and are particularly useful in situations where the relationship between the independent variables and the event of interest is rather complex. However, in many clinical settings, the occurrence of the event of interest is affected by competing events, which means that a patient can experience an outcome other than the event of interest. Neglecting the competing event (i.e. regarding competing events as censoring) will typically result in biased estimates of the cumulative incidence function (CIF). A popular approach for dealing with competing events is Fine & Gray’s subdistribution hazard model, which performs estimation of the CIF by fitting a single-event model defined on a subdistribution time scale. Here, we integrate concepts from the subdistribution hazard modeling approach into the RSF: We utilize the central feature of RSF - the creation of multiple decision trees, each of which is trained on a random subset of the data. In each tree, the competing event time is replaced by an imputed, possibly right-censored subdistribution time and split rules for single-event RSF are applied. The predictions from the individual trees are then combined to obtain a final prediction. The performance of our proposed method is illustrated by a simulation study.



12:00pm - 12:20pm

Generative modeling of epidemiological data using adversarial random forests

Jan Kapar1,2, Kathrin Günther1, David S. Watson3, Marvin N. Wright1,2,4

1Leibniz Institute for Prevention Research and Epidemiology - BIPS; 2University of Bremen; 3King’s College London; 4University of Copenhagen

Generative modeling of epidemiological data using adversarial random forests

Generative modeling holds great potential for epidemiological data as it opens the door for applications like realistic data imputation for missing data, data augmentation for enhancing predictive performance and privacy-preserving data analysis. However, while deep learning algorithms such as variational autoencoders (VAEs) and generative adversarial networks (GANs) have shown ground-breaking results generating realistic synthetic image, audio and text data during the last decade, these methods often struggle to produce high quality synthetic tabular data. Further, deep learning algorithms are notoriously data-hungry and require extensive tuning.

We present the concept of adversarial random forests (ARFs), a method based on unsupervised random forests that shows promising results for tabular data with both continuous and categorical features. Unlike many deep learning methods, ARFs perform well without expensive hyperparameter tuning and often show good results also on comparably small datasets. Training time for ARFs is considerably shorter than for state-of-the-art deep learning models for tabular data.

To evaluate the utility of synthetic data created with ARFs in real world epidemiological applications, we replicate statistical analyses of already published studies based on the German national cohort dataset (NAKO). We demonstrate that ARFs are capable of successfully learning the underlying structures of the data so that the results of descriptive, inferential and predictive tasks performed on ARF-synthestized data are comparable to the results obtained on the original data and excel in comparison with state-of-the-art deep learning models.



 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: CEN 2023
Conference Software: ConfTool Pro 2.6.149+TC
© 2001–2024 by Dr. H. Weinreich, Hamburg, Germany