Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

 
 
Session Overview
Session
S69: Analysis of omics data II
Time:
Thursday, 07/Sept/2023:
10:40am - 12:20pm

Session Chair: Tomasz Burzykowski
Session Chair: Cornelia Dunger-Baldauf
Location: Seminar Room U1.195 hybrid


Show help for 'Increase or decrease the abstract text size'
Presentations
10:40am - 11:00am

High-dimensional graphical models varying with multiple external covariates

Louis Dijkstra, Ronja Foraita

Leibniz Institute for Prevention Research and Epidemiology - BIPS, Bremen, Germany

High-dimensional networks play a key role in understanding complex relationships. These relationships are often dynamic in nature and can change with multiple external factors (e.g., time and case-control status). Methods for estimating graphical models are often restricted to static graphs or graphs that can change with a single covariate (e.g., time). We propose a novel class of graphical models, the covariate-varying network (CVN), that can change with multiple external covariates. This extension sounds trivial at first; however, it poses serious conceptual and computational challenges.

In order to introduce sparsity, we apply a L1 penalty on the precision matrices of m ≥ 2 graphs we want to estimate. These graphs often show a level of similarity (i.e., the graphs are ‘smooth’). This smoothness is modelled using a ‘meta-graph’ with m nodes, each corresponding to a graph one wants to estimate. The (weighted) adjacency matrix of the meta-graph represents the strength with which similarity is enforced between the m graphs.

The resulting optimization problem is solved by employing an alternating direction method of multipliers (ADMM). One update-step in the resulting ADMM requires one to repeatedly solve a ‘weighted fused signal approximator’, which, to the best of our knowledge, had not been solved before. We do this by reformulating it as a Generalized LASSO problem and solving it with an ADMM developed specifically for this task.

We test our method using a simulation study and we show the method’s applicability by analyzing the dependence structure of gene expressions within the p53 pathway of head and neck squamous cell carcinoma patients; a dataset from The Cancer Genome Atlas (TCGA; https://www.cancer.gov/tcga) with tumor stage and tumor site as external covariates.



11:00am - 11:20am

Detecting interactions in High Dimensional Data using Cross Leverage Scores

Sven Teschke1,2, Katja Ickstadt1, Alexander Munteanu1, Tamara Schikowski2

1TU Dortmund, Germany; 2IUF Düsseldorf, Germany

We are developing a variable selection method for regression models in Big Data in the context of Genetics. In particular, we want to detect important interactions between variables. The method is intended for investigating the influence of SNPs and their interactions on health outcomes, which is a p ›› n problem.

Motivated by Parry et al. (2021), we use the so called cross leverage scores to detect interactions of variables while maintaining interpretability. The big advantage is that this method does not require considering each possible interaction between variables individually, which would be very time consuming even for a moderately large amount of data. In a simulation study we show that these cross leverage scores are directly linked to the importance of a variable in the sense of an interaction effect.

Furthermore, we are developing methods for the detection of interactions using cross leverage scores to very large datasets, as they are common in the context of genetics. The key idea is to divide the data set into subsets of variables (batches). Successively, for each batch we store the (predefined) q most important variables, compare them to those selected from the previous batch, store the combined q most important variables and reject the rest. We receive the q most important variables of the whole data set after analyzing all batches. Thus, we avoid complex and time-consuming computations of high-dimensional matrices by performing the computations only for small batches of the partitioned data set, which is much less costly. We then also compare these methods to existing approximation methods for calculating cross leverage scores (Drineas et al. (2012)). We evaluate these methods with simulation studies and with a real data set, the SALIA study (Study on the Influence of Air Pollution on Lung, Inflammation and Aging) (Schikowski et al. (2005)). This study investigates the influence of air pollution on lung function, inflammatory responses and aging processes in elderly women from the Ruhr area. Since we are particularly interested in genetic data, we consider n=517 women from this study. In addition to data on influences of various environmental factors, data on over 7 million SNPs is also available for these women. We are exploring the influence of both SNP interactions and SNP environment interactions on various health outcomes.

Drineas P., Magdon-Ismail M., Mahoney M.W., Woodruff D.P. (2012). Fast approximation of matrix coherence and statistical leverage. J Machine Learning Research 13, 3475-3506, doi: 10.5555/2503308.2503352.

Parry, K., Geppert, L., Munteanu, A., Ickstadt, K. (2021). Cross-Leverage Scores for Selecting Subsets of Explanatory Variables. arXiv e-prints, abs/2109.08399, https://arxiv.org/abs/2109.08399.

Schikowski, T., Sugiri, D., Ranft, U. et al. (2005). Long-term air pollution exposure and living close to busy roads are associated with COPD in women. Respir Res 6, 152. https://doi.org/10.1186/1465-9921-6-152



11:20am - 11:40am

The impact of missing SNPs in the calculation of polygenic scores

Hanna C. B. Brudermann, Inke R. König

Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Campus Lübeck, Lübeck, Germany

Polygenic scores (PGS) aggregate the information of many genome-wide markers – mostly single nucleotide polymorphisms (SNPs) – to estimate the genetic susceptibility of a person to a specific phenotype. Over the few last years, guidelines on how to construct PGS have been published (Choi et al. 2020), and the Polygenic Score Catalog (Lambert et al. 2021) is a free resource to screen and download PGS.

When applying a previously published PGS to new data often not all markers that are part of the PGS are available, and those which differ in their quality. If, for example, quality-controlled genotyped and imputed data are used, SNPs fall into different categories of availability: high or low quality and directly genotyped or imputed, or unavailable.

However, knowledge of the impact of various types and degrees of missing markers on the performance of available PGS is limited.

In 2018, Chagnon et al. investigated the influence of missing markers on the calculation of a PGS. They compared the gold standard score based on all genotypes with scores resulting from two different strategies. In the first, the missing markers are omitted in the calculation of the score; and in the second, the missing genotypes are replaced with the genotypes of a proxy SNP with a predefined linkage disequilibrium. The resulting scores were compared with the gold standard regarding correlation and AUC among other PGS quality measures. The results showed that the use of a proxy SNP is generally better than omitting the marker but that attention has to be paid if the missing marker has a relatively high effect size.

In this work, a comprehensive simulation study is performed to extend the methods of Chagnon et al. (2018) in a practically important way. Given that mostly imputed data is typically used, we now consider not only genotyped but also imputed SNPs of different imputation quality. Therefore, for a specific missing marker, one can theoretically choose between a proxy SNP and an imputed one. Nevertheless, it is common for many markers to be missing again after post-imputation quality control.

The first results show that the results of Chagnon et al. (2018) hold for those SNPs that are still missing after imputation. Also, SNPs with a high info score (< 0.9) after imputation show similar behavior as very good proxy SNPs (r2 ≥ 0.9) while imputed SNPs with an info score < 0.1 still behave equally to good proxy SNPs (0.6 ≤ r2 < 0.8) for low frequencies of missing genotypes (< 20%); and worse than good proxy SNPs for higher frequencies of missing genotypes.

Using both of Chagnon’s et al. (2018) strategies to work around missing markers and combine these with the usage of imputed markers, the impact of different degrees of missing markers is investigated. From this, a guideline can be derived for the practical use of PGS.

Literature:

  • Lambert et al. 2021 Nat Genet 53:420-5; doi: 10.1038/s41588-021-00783-5.
  • Choi et al. 2020 Nat Prot 15:2759-72; doi: 10.1038/s41596-020-0353-1.
  • Chagnon et al. 2018 PLoS One 13(7); doi: 10.1371/journal.pone.0200630.


11:40am - 12:00pm

Pre-processing and quality control of whole genome sequencing data: a case study using 9000 samples from the GENESIS-HD study

Raphael O. Betschart1, Domingo Aguilera-Garcia2, Hugo Babel1, Stefan Blankenberg1,3,4, Linlin Guo3, Holger Moch2, Dagmar Seidl2, Felix Thalén1, Alexandre Thiéry1, Raphael Twerenbold3,4, Tanja Zeller3,4, Martin Zoche2, Andreas Ziegler1,3,5,6

1Cardio-CARE, Medizincampus Davos, Switzerland; 2Institute of Pathology and Molecular Pathology, University Hospital Zurich, Zurich, Switzerland; 3University Center of Cardiovascular Science and Department of Cardiology, University Heart and Vascular Center, University Medical Center Eppendorf, Hamburg, Germany; 4German Center for Cardiovascular Science (DZHK); partner site Hamburg/Kiel/Lübeck, Hamburg, Germany; 5Swiss Institute of Bioinformatics, Lausanne, Switzerland; 6School Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, South Africa

Rapid advances in high-throughput DNA sequencing technologies have enabled the conduct of large-scale whole genome sequencing (WGS) studies. Before association analysis between phenotypes and genotyped can be conducted, extensive pre-processing and quality control (QC) of the raw sequence data need to be performed. This case study describes the pre-processing pipeline and QC framework we have selected for the GENEtic SequencIng Study Hamburg-Davos (GENESIS-HD), a study involving more than 9000 human whole genomes. All samples were sequenced on a single Illumina NovaSeq 6000 with an average coverage of 35x, using a PCR-free protocol and unique dual indices (UDI). For quality control (QC), one genome in a bottle (GIAB) trio was sequenced in tetraplicate, and one GIAB sample was successfully sequenced 70 times in different runs. In this presentation, we illustrate the application of important QC metrics to the data at the different pre-processing stages. We provide empirical data for the compression of raw data using the novel original read archive (ORA). Our results show that the most important quality metrics for sample filtering were ancestry, sample cross-contamination, deviations from the expected Het/Hom ratio, relatedness, and too low coverage. The compression ratio of the raw files using ORA was 5:1, and the compression time was linear with respect to genome coverage. In summary, the pre-processing, joint calling, and QC of large WGS studies is feasible in reasonable time, and efficient QC procedures are readily available.



12:00pm - 12:20pm

More than meets the eye: Dimension reduction and temporal patterns in time-series single-cell RNA-sequencing data

Maren Hackenberg, Laia Canal Guitart, Harald Binder

Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center -- University of Freiburg, Germany

Generating single-cell RNA-sequencing (scRNA-seq) data at several time points, e.g., during a developmental process, promises insights into mechanisms controlling cellular differentiation at the level of individual cells. As there is no one-to-one correspondence between cells at different timepoints, a first step in a typical analysis workflow is to reduce dimensionality to visually inspect temporal patterns. Here, one implicitly assumes that the resulting low-dimensional manifold captures the central gene expression dynamics of interest. Yet, commonly used techniques are not specifically designed to do so and their representations do not necessarily coincide with the one that best reflects the actual underlying dynamics.

We thus investigate how visual representations of different temporal patterns in time-series scRNA-seq data depend on the choice of dimension reduction, considering principal component analysis (PCA), t-distributed stochastic neighbourhood embedding (t-SNE), uniform manifold approximation and projection (UMAP) and single-cell variational inference (scVI), a popular deep learning-based approach.

To characterize the approaches in a controlled setting, we create an artificial time series from a snapshot scRNA-seq dataset by simulating an underlying low-dimensional developmental process and generating corresponding high-dimensional gene expression data. Specifically, we apply a specific dimension reduction approach (say, tSNE) on the snapshot data and transform the low-dimensional representation according to biologically meaningful temporal patterns, e.g., dividing cell clusters during a differentiation process. We train a deep learning model to generate synthetic high-dimensional gene expression profiles corresponding to the simulated pattern at each time point, and apply the different dimension reduction approaches on the high-dimensional time-series data to compare how well they reflect the underlying temporal pattern introduced in, e.g., t-SNE space.

We thus characterize the different perspective of each technique on a specific temporal pattern with respect to the underlying representation in which the pattern was introduced and to the pattern itself. The results illustrate how the choice of the dimension reduction approach can dramatically alter, i.e, distort, temporal structure. To alleviate such problems, we provide directions for designing dimension reduction techniques that explicitly respect temporal structure.



 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: CEN 2023
Conference Software: ConfTool Pro 2.6.151+TC
© 2001–2024 by Dr. H. Weinreich, Hamburg, Germany