8:30am - 8:50amDESpace: a novel analysis framework to discover spatially variable genes
Peiying Cai1, Mark D Robinson1, Simone Tiberi1,2
1University of Zurich, Switzerland; 2University of Bologna, Italy
Background
Spatially resolved transcriptomics (SRT) technologies allow measuring gene expression profiles, while also retaining information of the spatial tissue. SRT technologies have led to the release of novel methods that take advantage of the joint availability of mRNA abundance and spatial information. Notably, several computational tools have been developed to identify spatially variable genes (SVGs), i.e., genes whose expression profiles vary across tissue. Nonetheless, current approaches for SVG detection present some limitations; in particular:
i) most methods are computationally intensive;
ii) biological replicates are not allowed;
iii) information about known spatial structures (usually) cannot be incorporated ;
iv) testing cannot be performed on specific regions of interest (e.g., white matter in brain cortex).
Methodology
We propose DESpace, an intuitive framework for identifying SVGs based on differential testing across spatial clusters. These clusters represent spatially neighbouring cells with similar expression profiles, and can be obtained via spatial clustering tools (e.g., BayesSpace, StLearn, Giotto and PRECAST), or via pathologists’ annotations. We use these clusters as a proxy for the actual spatial information. We then employ edgeR, a popular tool for differential expression analyses, to perform differential testing across spatial clusters. Intuitively, if the mRNA abundance of a gene is significantly associated to the spatial clusters, then it varies across the tissue, which indicates a SVG.
Clearly, our framework relies on spatial clusters being available and summarizing the main spatial features of the data. Nonetheless, even in the absence of pre-computed annotations, spatially resolved clustering tools allow generating clusters that accurately summarize the spatial structure of gene expression.
Additionally, DESpace presents some unique features compared to currently available SVG tools; in fact, our framework:
i) can model multiple samples, reducing the uncertainty that characterizes inference performed from individual samples, and identifying genes with coherent spatial patterns across biological replicates;
ii) allows identifying the key areas of the tissue affected by SVG, testing if the average expression in a particular region of interest (e.g., cancer tissue) is significantly higher or lower than the average expression of the remaining tissue (e.g., non-cancer tissue), hence enabling scientists to investigate changes in mRNA abundance in specific areas which may be of particular interest.
Finally, our method is flexible, and can input any type of SRT data.
Benchmarking
We performed extensive benchmarks of our approach and various competitors (MERINGUE, nnSVG, SpaGCN, SPARK, SPARK-X, SpatialDE, SpatialDE2, and trendsceek). In particular, starting from three real spatial omics datasets as anchor data, we generated various semi-simulated datasets, with a wide variety of spatial patterns. Our approach displays well calibrated false discovery rates, and higher true positive rate than all competitors considered. Furthermore, when analyzing real data, the genes identified by DESpace are more coherent across replicates, than those detected by other SVG methods.
Availability
DESpace is implemented as an R package, currently available on GitHub, and is accompanied by an example usage vignettes: https://github.com/peicai/DESpace
DESpace was also submitted to Bioconductor, where it should appear in a few weeks.
A pre-print (in preparation) will follow in the coming weeks.
8:50am - 9:10amPathway analysis for multinomial phenotypes
Md. Kamruzzaman1, Taesung Park2
1Seoul National University, Korea, Republic of (South Korea); 2Seoul National University, Korea, Republic of (South Korea)
Many statistical methods for pathway analysis have been used to identify novel pathways from biomarkers associated with a certain disease. However, most of these methods are based on single pathway analysis and do not consider multiple pathways simultaneously. Since pathways are highly correlated, multiple pathways analyses suffer from this correlation. Furthermore, they mainly focus on only continuous, counts, and binary phenotypes. In this study, we propose a novel pathway analysis HisCoM-Categ for the multinomial phenotypes such as the obesity level observed as normal, overweight, and obese. HisCoM-Categ takes into account the hierarchical structure of biomarkers and pathways, as well as the correlations among pathways. Through the simulation study, HisCoM-Categ was shown to have higher power compared to the other existing methods. In addition, HisCoM-Categ was applied to the various types of omics data. This application demonstrated that HisCoM-Categ successfully identified the well-known pathways that are associated with multinomial phenotypes.
9:10am - 9:30amMulti-omics data integration: Does more mean better for predictive modeling? A large-scale benchmark study
Yingxia Li1, Ulrich Mansmann1, Roman Hornung1,2
1Institute for Medical Information Processing, Biometry and Epidemiology, University of Munich; 2Munich Center for Machine Learning (MCML)
Predictive modeling based on multi-omics data, that is, several types of omics data available for the same patients, has demonstrated the ability to potentially outperform single-omics predictive modeling. Previous research on using multi-omics data for prediction has focused on combining many types of data. However, collecting many omics data types is complex and costly, which is why it would be beneficial to collect only those omics data types that contribute to improving predictive performance. It is, however, unclear which combinations of omics data types are most effective and which types can generally be omitted without compromising predictive performance.
We compared the predictive performance of all 31 possible combinations of five genomic data types using different prediction methods applied to 14 cancer datasets with survival outcome. The data types considered were mRNA, miRNA, methylation, mutation, and copy number variation data. Clinical data were included and prioritized in each prediction model. To investigate the stability of the results, bootstrap analysis was performed at the level of the included datasets.
Contrary to our expectations, combining larger numbers of omics data types tended to degrade predictive performance. Instead, using only mRNA data or a combination of mRNA and miRNA data was sufficient in most cases. Although the number of datasets included in our study is comparatively large, it is still limited, which is why our results must be interpreted with caution. Nevertheless, they strongly suggest that integrating many omics data types in a multi-omics prediction context may be counterproductive.
9:30am - 9:50amBoosting interaction tree stumps for modeling gene–gene and gene–environment interactions
Michael Lau1,2, Tamara Schikowski2, Holger Schwender1
1Mathematical Institute, Heinrich Heine University, Düsseldorf, Germany; 2IUF – Leibniz Research Institute for Environmental Medicine, Düsseldorf, Germany
The development of complex phenotypes often not only depends on isolated genetic and environmental risk factors but also on their interplay. These phenomena are known as gene–gene (GxG) and gene–environment (GxE) interactions. A GxG interaction is present if the effect of participating loci depends on the presence of other participating loci, while a GxE interaction is defined as different susceptibilities to an environmental risk factor depending on the genotype. Classical procedures for modeling phenotypes based on genetic risk factors such as SNPs (single nucleotide polymorphisms) either depend on simplifying assumptions such as linearity in generalized linear models or produce non-interpretable black-box models such as random forests or deep neural networks.
To overcome these drawbacks, we propose a statistical learning method called BITS (boosting interaction tree stumps) that aims at fitting simple-to-read linear models that incorporate GxG and GxE interactions. In every boosting iteration, tree stumps are fitted that – instead of the usual split on a single input variable – may split on interactions of the input variables. To avoid unnecessarily complex models, these interaction tree stumps are regularized for including long interactions and the resulting model is pruned and transformed into a linear model using the elastic net. GxE interactions are incorporated by including the environmental variable and potential interactions with the identified terms.
In contrast to many related methods, the computational complexity of BITS scales linearly with the number of input variables such that BITS is also suited for high-dimensional tasks.
In a simulation study, it is shown that BITS outperforms existing methods regarding the predictive ability on unseen data. Moreover, multisplitting is employed for statistically testing GxG and GxE interactions. The simulations also show that BITS controls the type I error rate for detecting GxG and GxE interactions, true underlying terms are often identified, and GxE interactions are detected with a high power. Furthermore, BITS and related methods are applied and compared in a real data application analyzing data from a German cohort study.
9:50am - 10:10amTesting for associations in genomic data with distances and kernels: From unconditional to conditional settings
Fernando Castro-Prado1,2, Wenceslao Gonzalez-Manteiga1, Javier Costas2, Dominic Edelmann3
1University of Santiago de Compostela, Spain; 2Health Research Institute of Santiago de Compostela, Spain; 3German Cancer Research Centre, Heidelberg, Germany
Distance covariance is an association measure that characterises general statistical independence (not only the linear one) between random vectors on arbitrary metric spaces (not only Euclidean ones). It is dual to the to the Hilbert–Schmidt independence criterion, popular in the machine learning community. With the toolbox of any of the two schools (i.e., strong negative type distances or characteristic kernels, respectively), it is possible to provide meaningful insight into the analysis of data from genome-wide association studies. We briefly introduce some work of us in which we apply these techniques to the search for genetic variants with significant marginal effects on a phenotypical trait of interest, and to the detection of gene-gene interactions. At this point, we wonder what happens when we try to test for such associations conditioning on an environmental covariate of interest. This yields to the conditional version of distance covariance, adapted to the particular geometry that we define to account for the structure of our genetic data. We show some theoretical properties of the resulting test statistic and we explore the performance of our methodology with simulations and a real data example.
|