8:30am - 8:50amOver-optimism in gene set analysis: How does the choice of methods and parameters influence the detection of differentially enriched gene sets?
Milena Wünsch1,3, Christina Nießl1, Ludwig Christian Hinske2, Anne-Laure Boulesteix1
1Institute for Medical Information Processing, Biometry and Epidemiology, University of Munich, Munich, Germany; 2Institute for Digital Medicine, University Hospital of Augsburg, Augsburg, Germany; 3Munich Center for Machine Learning (MCML), Munich, Germany
Gene set analysis, a popular approach for analyzing high-throughput gene expression data, aims to identify sets of related genes that show significantly enriched or depleted expression patterns between two opposed conditions. In addition to the multitude of methods available for this task, the user is typically left with many options when creating the required input and specifying the internal parameters of the chosen method. This flexibility might not only make it difficult to get a clear overview of all steps required to conduct gene set analysis but also entice users to produce the most preferable results in a “trial-and-error” manner. While this procedure seems natural at first glance, it can be viewed as a form of “cherry-picking” and cause an over-optimistic bias in the results. Since the method and its underlying parameters are exceedingly fitted to the given gene expression dataset, the results may not be replicable with a different dataset, leading to a loss in validity – a problem that has attracted a lot of attention in the context of classical hypothesis testing. In this talk, we aim to raise awareness of this type of over-optimism in the more complex context of gene set analysis. First, we give an overview of the general theoretical background of gene set analysis and summarize the methodology behind seven popular methods classified as Over-Representation Analysis or Functional Class Scoring. Second, we discuss the practical aspects of applying these methods, which are implemented either in popular R packages, such as clusterProfiler and GOSeq, or in web-based applications, such as GSEA. Finally, to address the problem of over-optimism, we mimic a hypothetical researcher engaging in the systematic selection of the underlying options for the purpose of optimizing the results. More precisely, we perform optimization for three metrics, each within two real gene expression datasets frequently used in benchmarking. In addition to optimizing these metrics for the true sample labels of the gene expression datasets, we repeat this procedure for ten randomly generated permutations of the sample labels. Our study suggests that for most gene set analysis methods, the set of options left to the user can lead to a particularly high variability in the number of differentially enriched gene sets as well as in the ranking of the gene sets in the corresponding results. This underlines the risk of selective reporting and over-optimistic results in the context of gene set analysis.
8:50am - 9:10amMaximum Test Method for the Wilcoxon-Mann-Whitney Test in High-Dimensional Designs
Lukas Mödl, Frank Konietschke
Institut für Biometrie und Klinische Epidemiologie -- Chartié Berlin, Germany
The statistical comparison of two multivariate samples is a frequent task, e.g. in biomarker analysis. Parametric and nonparametric multivariate analysis of variance (MANOVA) procedures are well established procedures for the analysis of such data. Which method to use depends on the scales of the endpoints and whether the assumption of a parametric multivariate distribution is meaningful. However, in case of a significant outcome, MANOVA methods can only provide the information that the treatments (conditions) differ in any of endpoints; they cannot locate the guilty endpoint(s). Multiple contrast tests in terms as maximum tests on the contrary provide local test results and thus the information of interest.
The maximum test method controls the error rate by comparing the value of the largest contrast in magnitude to the (1-α)-equicoordinate quantile of the joint distribution of all considered contrasts. The advantage of this approach over existing and commonly used methods that control the multiple type-I error rate, such as Bonferroni, Holm, or Hochberg, is that it is appealingly simple, yet has sufficient power to detect a significant difference in high-dimensional designs, and does not make strong assumptions (such as MTP2) about the joint distribution of test statistics. Furthermore, the computation of simultaneous confidence intervals is possible. The challenge, however, is that the joint distribution of the test statistics used must be known in order to implement the method.
In this talk, we develop a simultaneous maximum Wilcoxon-Mann-Whitney test for the analysis of multivariate data in two independent samples. We hereby consider both the cases of low-and high-dimensional designs. We derive the (asymptotic) joint distribution of the test statistic and propose different bootstrap approximations for small sample sizes. We investigate their quality within extensive simulation studies. It turns out that the methods control the multiple type-I error rate well, even in high-dimensional designs with small sample sizes. A real data set illustrates the application.
9:10am - 9:30amDeriving interpretable thresholds for Variable Importance in Random Forests by permutation
Hannes Buchner1, Laura Schlieker1, Maria Blanco1, Tim Mueller1, Armin Ott2, Roman Hornung3,4
1Staburo GmbH, Aschauer Str. 26a, 81549 München, Germany; 2Roche Diagnostics GmbH, MMDHA, Nonnenwald 2, 82377 Penzberg, Germany; 3Institute for Medical Information Processing, Biometry and Epidemiology, University of Munich; 4Munich Center for Machine Learning (MCML)
In the context of clinical research and in particular precision medicine the identification of predictive or prognostic biomarkers is of utmost importance. Especially when dealing with high-dimensional data discriminating between informative and uninformative variables plays a crucial role. Machine Learning approaches and especially Random Forests are promising approaches in this situation as the variable importance of a Random Forest can serve as a decision guidance for the identification of potentially relevant variables.
Many different approaches for Random Forest variable importance have been proposed and evaluated (e.g., Degenhardt et al. 2019, Speiser et al. 2019). One of the algorithms is the well-performing Boruta method (Kursa and Rudnicki 2010), which adds permutated - and thus uninformative - versions of each variable (so-called shadow variables) to the set of predictors.
We propose a variation of the Boruta method, which is independent of the simulations runs and which compares the variable importance of each covariate directly with the permuted version of the covariate. In addition, in this method, the uninformative versions are generated by permutating the rows of the dataset, which preserves the relationship between the original variables. We aim to evaluate the relevance of the variables based on different criteria, e.g., proportion of positive difference in paired VIMP, mean of the shadow VIMPs and distance between paired distributions.
We examine our method on real data sets of varying sizes and compare its performance to the Boruta algorithm.
9:30am - 9:50amMind your zeros: accurate p-value approximation in permutation testing with applications in microbiome data analysis
Stefanie Peschel1,2, Martin Depner3, Erika von Mutius3,4,5,6, Anne-Laure Boulesteix2,7, Christian L Müller1,2,8,9
1Department of Statistics, LMU München, Munich, Germany; 2Munich Center for Machine Learning, Munich, Germany; 3Institute of Asthma and Allergy Prevention, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany; 4Department of Pediatric Allergology, Dr Von Hauner Children’s Hospital, LMU München, Munich, Germany; 5Comprehensive Pneumology Center Munich (CPC-M), Munich, Germany; 6German Center for Lung Research (DZL), Munich, Germany; 7Institute for Medical Information Processing, Biometry and Epidemiology, LMU München, Munich, Germany; 8Institute of Computational Biology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany; 9Center for Computational Mathematics, Flatiron Institute, New York, USA
Permutation procedures are common practice in statistical hypothesis testing when distributional assumptions about the considered test statistic are not met or unknown. With a small number of permutations, p-values may be either zero [1] or too large to remain significant after adjustment for multiple testing. However, in certain settings, achieving a sufficient number of permutations to obtain accurate p-values is often not feasible. For example, in biomedical studies, the high dimensionality of the data or the use of complex statistical inference methods can make even a single test computationally expensive. A popular heuristic solution to this problem is to approximate extreme p-values by fitting a Generalized Pareto Distribution (GPD) to the tail of the distribution of the permutation test statistics [2]. In practice, however, an estimated negative shape parameter in the GPD combined with extreme observed test statistics can again lead to zero p-values, making subsequent multiple testing problematic.
Here, we propose a complete workflow for accurate and reliable p-value approximation in permutation testing and multiple testing correction. Our framework includes a new method that fits a constrained GPD that strictly avoids zero p-values. We also address the well-known problem of defining an optimal tail threshold for GPD fitting [3] and propose new threshold selection approaches using goodness-of-fit tests. In a multiple testing setting, adjusting the approximated p-values for multiplicity is an essential final step. For this purpose, we introduce a resampling-based False Discovery Rate (FDR) correction procedure that uses the estimated permutation p-values instead of the usual test statistics.
We conduct an extensive simulation study based on the two-sample t-test that demonstrates that our proposed p-value approximation workflow has considerably higher accuracy compared to existing methods. We also illustrate the real-world relevance of our framework in the context of host-associated gut microbiome data analysis, including differential abundance and differential association testing.
Our computational p-value approximation framework, including precise fitting of GPD parameters, tail threshold detection, and multiple testing adjustment, will be made available in the open-source R package permAprox on GitHub and CRAN.
References:
[1] Phipson, Belinda, and Gordon K. Smyth. "Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn." Statistical applications in genetics and molecular biology 9.1 (2010).
[2] Knijnenburg, Theo A., et al. "Fewer permutations, more accurate P-values." Bioinformatics 25.12 (2009): i161-i168.
[3] Langousis, Andreas, et al. "Threshold detection for the generalized Pareto distribution: Review of representative methods and application to the NOAA NCDC daily rainfall database." Water Resources Research 52.4 (2016): 2659-2681.
9:50am - 10:10amA novel approach to Function-on-Scalar Regression (FoSR) for the analysis of Periodic Time-Series
Konrad Neumann
Charité, Germany
Analysis of periodic time-series plays an important role in research dealing with data from wearables such as smart watches or accelerometer devices. Function-on-scalar regression (FoSR) is a popular method of analysing such data ([1] and [2]). FoSR is a family of multivariate regression models that describe the association of covariates with time-series as a response. In the talk, a novel approach to FoSR will be presented. It follows the ideas of classical least squares analysis of the general linear model. In contrast to the classical approach, the components of the response vector may now be points in an arbitrary Hilbert space H, such as the L2 space. The coefficient functions and their least squares estimates are then points in a predefined finite dimensional subspace H’ of H. Furthermore, choosing the H’ carefully leads to a two-step multiple testing procedure that bounds the familywise error rate of first kind. For this version of FoSR only little assumptions must be made. Furthermore, this classical approach leads to explicit formulae of the coefficient functions and of the test statistics. An example from [3] will illustrate the method.
References
1. Goldsmith, J., Liu, X., Jacobson, J. S. & Rundle, A. New Insights into Activity Patterns in Children, Found Using Functional Data Analyses. Med. Sci. Sports Exerc. 48, 1723–1729
2. Xiao, L. et al. Quantifying the lifetime circadian rhythm of physical activity: A covariate-dependent functional approach. Biostatistics 16, 352–367 (2015).
3. Rackoll, T., Neumann, K., Passmann S., Grittner, U., Külzow, N., Ladenbauer, J., Floel, A. Applying time series analyses on continuous accelerometry data- A clinical example in older adults with and without cognitive impairment. PloS One 16(5) (2021).
|