Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

 
 
Session Overview
Session
Poster 1: Poster Speed Session 1
Time:
Monday, 04/Sept/2023:
10:00am - 10:30am

Session Chair: Kristina Weber
Location: Lecture Room U1.111 hybrid with zoom


Show help for 'Increase or decrease the abstract text size'
Presentations

Statistical approaches to the design and analysis of patient preference studies.

Byron Jones, Maria Costa

Novartis, Switzerland

Including patient views into drug development and post-marketing decisions is becoming increasingly important to regulatory agencies, Health Technology Assessment (HTA) bodies and the pharmaceutical industry [1,2,3]. One important contributor to the totality of patient evidence needed by all the aforementioned is a quantitative patient preference study.

This presentation will describe the design and analysis of one major type of patient preference study: the Discrete Choice Experiment (DCE). In a DCE patients are presented with a series of choice tasks, where, in each task, they are presented with profiles of treatments, devices or symptoms and asked to choose the profile they prefer. The basic structure of the design of a DCE is similar to that of a fractional factorial design. Statistical models that are typically used to analyse the preferences obtained in a DCE are, in increasing order of complexity, the multinomial logistic model (MNL), the random parameters logistic model (RPL) and the hierarchical Bayesian model (HB). Optimal designs for a DCE depend on the model chosen and can be obtained using theoretical results or computer search algorithms. A brief description of the different approaches will be given.

The analysis of a DCE will be described using a small case study based on a published DCE [4] that collected preferences from patients suffering with COPD. The results from the different models will be compared and some useful graphical displays of the results will be given.

References

https://www.fda.gov/drugs/development-approval-process-drugs/cder-patient-focused-drug-development

https://www.ema.europa.eu/en/documents/presentation/presentation-ema/fda-patient-focused-drug-development-ich-reflection-paper-mbonelli-ema_en.pdf

Shared decision making NICE guideline Published: 17 June 2021 www.nice.org.uk/guidance/ng197

Cook, N.S., Criner, G.J., Burgel, P-R, Mycock, M., Gardener, T., Mellor, P. Hallworth, P., Sully, K., Tatlock, S., Klein, B., Jones, B., Le Rouzic, O., Adams, K., Phillips, K., McKevitt, M. Toyama, K. and Gutzwiller, F. (2022). People living with moderate-to-severe COPD prefer improvement of daily symptoms over the improvement of exacerbations: a multicountry patient preference study. ERJ Open Research, 01 Apr 2022, 8(2):686-2021. DOI: 10.1183/23120541.00686-2021 

CDER’s Patient-Focused Drug Development



From Desirability Towards Bayesian Design Space

Martin Otava

Janssen Pharmaceutical Companies of Johnson & Johnson, Czechia

The design space answers basic scientific question: across the experimental space, which settings lead to high probability of achieving certain quality threshold? [1] For example in pharmaceutical manufacturing context: the probability that a chemical synthesis will result in yield above 97%.

Traditional setup to answer such question is frequentist modelling of available data from designed experiment and defining a desirability function: function of quality level with its value reflecting our desire for such level [2]. Simplest version for yield example would be a step function that is zero below 97% and is one for any value above 97%. Naturally, argument arises that 96.9% could be also rather acceptable and desirability framework allows us to reflect that by specifying less steep function, e.g. zero below 90% and then linearly increasing from 90 to 97% or even smoother function.

However, there are two main issues with such approach. Firstly, the function is often chosen very arbitrarily in practice as it can be rather unclear which functional shape should be chosen. Moreover, it attempts to precisely quantify the desire of particular quality level, but the evaluation is based solely on average predictions, ignoring any uncertainty in the estimate of the relationship between predictor and response.

The frequentist design space checks whether model predictions at each experimental setting are above pre-specified quality threshold, providing an estimate of probability of interest. However, such estimate is still typically based only on residual error and average prediction. As any simulation from confidence intervals on model parameters is incompatible with frequentist framework, adding uncertainties on various predictors’ effect sizes is cumbersome matter.

Bayesian framework allows to address the uncertainty in all parameters directly by conducting posterior simulation [4]. Posterior distribution of response can be used to determine probability of compliance with specifications for a given settings of covariates. Alternatively, the probability itself can be seen as the quantity of interest and calculated from analytical closed-form solution at each iteration to obtain posterior distribution of the probability itself.

In this presentation, we look at the advantages and disadvantages of both desirability and Bayesian design space framework, including the implementation and interpretation challenges. We will demonstrate the flexibility of Bayesian framework and emphasize the need of clearly identifying aforehand what is the ultimate metric of interest. A case study from pharmaceutical manufacturing will be shown (with simulated data), but the applicability of the demonstrated principles is broader: the interpretation of probabilistic statements in Bayesian framework in comparison to counterpart frequentist methods, the correct use of Bayesian results and worthiness of more complicated framework in simple cases.

References:

[1] Harrington Jr. E. C. (1965). The Desirability Function. Industrial Quality Control: 21(10): 494-498.

[2] Del Castillo E., Montgomery D. C., McCarville D. R. (1996). Modified Desirability Functions for Multiresponse Optimization. Journal of Quality Technology, 28(3): 337-345.

[3] Lebrun P., Boulanger B., Debrus B., Lambert P., Hubert P. (2013) A Bayesian Design Space for Analytical Methods Based on Multivariate Models and Prediction. Journal of Biopharmaceutical Statistics, 23(6): 1330-1351



Fast and Precise Survival Testing for Genome-Wide Association Studies

Tong Yu1, Axel Benner1, Dominic Edelmann1,2

1German Cancer Research Center, Heidelberg, Germany; 2NCT Trial Center, Heidelberg, Germany

Genome-wide association studies (GWAS) involve testing millions of single-nucleotide polymorphisms for their association with survival responses. To adjust for multiple testing, the genome-wide significance level with α=5x10-8 is commonly used. However, standard survival statistics based on the Cox model such as the Wald or the Score test are cannot reliably control the type I error rate for this level. On the other hand, more reliable alternatives, such as the Firth correction, are computationally expensive, making them impractical for large datasets.

We compared the type I error rate, power, and runtime of various Cox model-based survival tests, including the Score test, Wald test, Likelihood ratio test, Firth correction, and a saddle-point approximation based Score test (SPACOX) using simulations and real data from the UK Biobank. Our findings reveal that the Wald and Score tests are highly anti-conservative for low minor allele frequencies (MAFs) and/or event rates, whereas SPACOX is substantially conservative in some settings. Furthermore, these tests exhibit different behavior depending on the direction of the effect. Except for score test-based procedures, the runtime of all tests is prohibitively high, particularly for the Firth correction.

To address this challenge, we propose a fast and precise testing procedure for GWAS based on prescreening via an extremely efficient version of the Score test, followed by testing of the screened subset of genes using the Firth correction or Likelihood ratio test. We demonstrate the performance of our test using simulations and real data from the UK Biobank. Our method provides a practical and accurate alternative for GWAS that can be applied to large datasets.



Improved protein quantification by using bipartite peptide-protein graphs

Karin Schork1,2,3, Michael Turewicz1,2,4,5, Julian Uszkoreit1,2,6, Jörg Rahnenführer3, Martin Eisenacher1,2

1Medizinisches Proteom-Center, Medical Faculty, Ruhr-University Bochum, Germany; 2Medical Proteome Analysis, Center for Protein Diagnostics (PRODI), Ruhr-University Bochum, Bochum, Germany; 3Department of Statistics, TU Dortmund University, Dortmund, Germany; 4Current address: Institute for Clinical Biochemistry and Pathobiochemistry, German Diabetes Center (DDZ), Leibniz Center for Diabetes Research at the Heinrich Heine University Düsseldorf, Düsseldorf, Germany; 5Current address: German Center for Diabetes Research (DZD), Partner Düsseldorf, München-Neuherberg, Germany; 6Current address: Universitätsklinikum Düsseldorf, Düsseldorf, Germany

Introduction:

In bottom-up proteomics, proteins are enzymatically digested to peptides (smaller amino acid chains) before measurement with mass spectrometry (MS), often using the enzyme trypsin. Because of this, peptides are identified and quantified directly from the MS measurements. Quantification of proteins from this peptide-level data remains a challenge, especially due to the occurrence of shared peptides, which could originate from multiple different protein sequences.

The relationship between proteins and their corresponding peptides can be represented by bipartite graphs. In this data structure, there are two types of nodes (peptides and proteins). Each edge connects a peptide node with a protein node, if and only if the peptide could originate from a tryptic digestion of the protein. The aim of this study (Schork et al, 2022, PLOS ONE) is to characterize and structure the different types of graphs that occur and to compare them between different data sets. Furthermore, we want to show how this knowledge can aid relative protein quantification. Our focus is especially on gaining quantitative information about proteins with only shared peptides, as they are neglected by many current algorithms.

Methods:

We construct bipartite peptide-protein graphs using quantified peptides from three measured data sets, as well as all theoretically possible peptides from the corresponding protein sequence databases. The structure and characteristics of the occurring graphs are compared between data sets as well as between database (theoretical) and quantitative level.

Additionally, we developed and applied a method that calculates protein ratios from peptide ratios by making use of the bipartite graph structures. For each peptide node, an equation is formed based on the bipartite graph structures and the measured peptide ratios. Protein ratios are estimated by using an optimization method to find solutions with a minimal error term. Special focus lies on the proteins with only shared peptides, which often lead to a range of optimal solutions instead of a point estimate.

Results:

When comparing the graphs from the theoretical peptides to the measured ones, two opposing effects can be observed. On the one hand, the graphs based on measured peptides are on average smaller and less complex compared to graphs using all theoretically possible peptides. On the other hand, the proportion of protein nodes without unique peptides, which are a complicated case for protein quantification, is considerably larger for measured data. Additionally, the proportion of graphs containing at least one protein node with only shared peptides rises, when going from database to quantitative level.

Conclusion:

Large differences between the structures of bipartite peptide-protein graphs have been observed between database and quantitative level as well as between the three analyzed species. In the three analyzed measured data sets, the proportion of protein nodes without unique peptides were 6.3 % (yeast), 46.6 % (mouse) and 55.0 % (human), respectively. Especially for these proteins, the usage of information from the bipartite graph structures for protein quantification is beneficial.



Missing values and inconclusive results in diagnostic studies – a scoping review of methods

Katharina Stahlmann1, Johannes B. Reitsma2, Antonia Zapf1

1Institute of Medical Biometry and Epidemiology, University Medical Center Hamburg-Eppendorf, Hamburg, Germany; 2Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands

Introduction:

Inappropriate handling of missing values can lead to biased results in any type of research. In diagnostic accuracy studies, this can have severe consequences by leading to misdiagnosing many patients and assigning them to the wrong treatment. Nonetheless, most diagnostic studies exclude missing values and inconclusive results in either the index test or reference standard from the analysis or apply simple methods resulting in biased accuracy estimates. This may be due to the lack of availability or awareness of appropriate methods. Therefore, the aim of this scoping review was to provide an overview of strategies to handle missing values and inconclusive results in the reference standard or index test in diagnostic accuracy studies.

Methods:

We conducted a systematic literature search in MEDLINE, Cochrane Library, and Web of Science up to April 2022 to identify methodological articles proposing methods for handling missing values and inconclusive results in diagnostic studies. Additionally, reference lists and Google Scholar citations were searched. Besides methodological studies, we also searched for studies applying the proposed methods.

Results:

Of 110 articles included in this review, most (n=67) addressed missing values in the reference standard. A further 15 and 12 articles proposed methods for handling missing values in the index test and in the index and reference test, respectively. Lastly, 17 articles presented methods for handling inconclusive results. Methods for missing values in the index test and inconclusive results encompass imputation, frequentist and Bayesian likelihood, model-based, and latent class methods. Most of these methods can be applied under missing (completely) at random, but only a few also incorporated missing not at random assumptions. While methods for missing values in the reference standard are regularly applied in practice, this is not the case for methods addressing missing values and inconclusive results in the index test.

Discussion:

Missing values and inconclusive results in the index test are commonly not adequately addressed in diagnostic studies despite the availability of various methods based on different assumptions. This may be due in part to the lack of programming code, R packages, or Shiny apps, which would facilitate the application. Our comprehensive overview and description of available methods may be the first step to raising further awareness of these methods and enhancing their application. Nevertheless, future research is needed to compare the performance of these methods under different conditions to give valid and robust recommendations for their usage in various diagnostic accuracy research scenarios. Within our project, we currently work on a simulation study to compare the identified methods in order to make recommendations regarding their application.



Evaluating statistical matching methods for treatment effects after different aortic valve replacement surgeries

Veronika Anna Waldorf, Eva Herrmann

Goethe University Frankfurt, Germany

For better comparison of treatment effects matching methods are commonly used in observational studies. Besides propensity score matching, regression models and inverse probability weighting schemes are often implemented.

In our simulation study the settings mimic patient groups of the German Aortic Valve Registry (GARY). Those groups represent the two most common aortic valve replacement surgeries: minimal invasive transcatheter aortic valve implantation (TAVI) and surgical aortic valve replacement (SAVR).

Patients presenting with different survival affecting health parameters will receive different treatments suitable to their varying operability. Therefore, patient population within SAVR is mostly in better health condition compared to TAVI. [1]

Even though statistics have shown a supremacy of SAVR surgeries in comparison to TAVI this superiority cannot be taken as fact due to the heterogenous group populations.

In our study setting we performed different propensity score matching methods as well as a weighting scheme and regression model in R using the packages MatchIt, twang and survival [2-4], aiming to decrease the imbalance of both groups and bias in comparing clinical outcome.

The treatment effect was measured with time-to-event analysis. To further answer the usability of these methods we simulated different group sizes from 500 up to 20.000 patients, with 500 simulations in each sample size.

References

1. Hamm CW et al. The German Aortic Valve Registry (GARY): in-hospital outcome. European Heart Journal. 2014; 35: 1588–1598.

2. Ho DE et al. MatchIt: Nonparametric Preprocessing for Parametric Causal Inference. Journal of Statistical Software. 2011; 42: 1–28.

3. Ridgeway G et al. twang: Toolkit for Weighting and Analysis of Nonequivalent Groups. R package version 1.4-9.5. 2016.

4. Therneau T. survival: A Package for Survival Analysis in R. R package version 3.5-3. 2023.



GUiMeta - A new graphical user interface for meta-analyses

Rejane Golbach

Goethe University Frankfurt, Germany

The popularity of meta-analyses is rising, which is reflected in the steadily growing number of publications. Many theoretical models are being newly and further developed while at the same time easy-to-learn tools for performing meta-analyses become increasingly important. Several menu-based programs but also R packages are available for conducting meta-analyses. Nevertheless, one of the major difficulties remains the summarization of the data, i.e., performing imputations, which is hardly supported by most programs. The R shiny [1] application GUiMeta provides a solution to this difficulty with a new interface for meta-analyses.

GUiMeta guides users through data entry, data analysis based on state-of-the-art R packages, and interpretation of the results supported by meaningful graphs and statistically substantiated results. One of GUiMeta’s major strengths is the adaptive data table, which is provided for documenting effect sizes from systematic reviews and structuring the heterogeneous data from various studies.

[1] Chang W, Cheng J, Allaire J, Sievert C, Schloerke B, Xie Y, Allen J, McPherson J, Dipert A, Borges B (2022). _shiny: Web Application Framework for R_. R package version 1.7.2



Impact of dose modifications on the assessment of the dose-tumor size relationship in oncology dose-ranging clinical trials

Francois Mercier1, Maia-Iulia Muresan2, Georgios Kazantzidis1, Daniel Sabanes-Bove1, Ulrich Beyer1

1F. Hoffmann-La Roche, Switzerland; 2University of Geneva

Changes in tumor size are among the most regarded endpoints in early clinical trials with solid tumor patients. Indeed, longitudinal data on the sum of lesion diameters (SLD) of target lesions provide insights into the extent and duration of response to treatment. Nonlinear mixed-effect models, referred to as tumor growth inhibition (TGI) models, have proved to accurately capture these data, and they have been used to support dose recommendation and prediction of survival. These models can be fitted with relatively sparse data, and therefore could also be used to inform the dose-response relationship based on phase 1a and/or phase 1b study data. In phase 1 studies, patients are exposed to a range of doses to assess safety/toxicity. However, early insights on efficacy are also instrumental to define the recommended phase 2 dose. In these trials, it is not unusual for participants to omit doses, or to receive doses lower than expected, in order to mitigate toxicity events.

Using simulations, we evaluate the impact of dose modifications on the ability to characterize the dose-response relationship using TGI models. Various scenarios are considered where doses are either reduced or omitted, in trials of sample size ranging from 3 to 10 patients per dose-cohort, and with a proportion of patients impacted by dose modifications ranging from 10% to 50%. In each case, a TGI dose response is fitted to the simulated data. Simulation outputs are expressed in terms of bias and (im)precision of the estimated dose-response model when compared to the hypothetical scenario of no dose modification.

We draw conclusions on the minimum conditions required to study the dose-effect on patients' tumor burden in phase 1 dose-escalation studies.



Simulation based sample size calculation in two-way fixed effects designs including interactions and repeated measures

Louis Rodrigue Macias, Silke Szymczak

Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck,Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany

Block designs to compare quantitative outcomes between groups of subjects assigned to a combination of j levels of factor A and k levels of factor B are widely used in animal studies. Existing tools for sample size calculation in this setting require effect size, expressed as partial η2, Cohen’s f and f2, among others, to be specified. Jacob Cohen1 provides guidelines on the interpretation of f values with respect to the variability of within group means. These examples have been interpreted as rules of thumb for small, medium and large effect sizes. However, determining the f values corresponding to expected group means in a j by k design is not intuitive. This is especially the case when there is interaction in the factors' effects. Moreover, when one of the factors is a repeated measure, within subject correlation must be considered. Finally, sample size estimation for non parametric tests are not easily implemented. For these reasons, we have implemented several R functions that perform these tasks by simulating the planned experimental setting for both independent and repeated measurements.

Sample size required for a power 1-β with a type 1 error α is calculated by estimating the proportion of iterations in which the p value of interest, either effect for factor A, factor B or their interaction, is smaller than α. The simulation input is a j by k matrix of expected means, a matrix of standard deviations (SD) with the same dimensions and group sample size which may be a single integer for the case of a balanced design or a matrix, if the design is unbalanced. A helper function is available to create the means and SD matrices. Input for this function includes a reference mean, which would typically correspond to the expected mean in group j,k =1, expected change from reference mean by each factor level in the multiplicative scale and expected deviation from linear effects because of interaction.

Sample sizes requirements were compared to those obtained for ANOVA by G*Power2. However, our implementation has the advantage of allowing unbalanced designs and to also estimate power for non-parametric tests. Computation time is < 5 minutes for 1000 iterations in a 2 by 3 design on a 3.9 GHz processor in the independent measurements case and ~2 hours in a 6 by2 repeated measurements design, depending on the sample size space explored. We plan to provide these functions as an R package that can be part of a series of useful tools when planning a two-way fixed effects study.

  1. Cohen. J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum.

  2. Faul, F., Erdfelder, E., Lang, A.-G., & Buchner, A. (2007). G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods, 39, 175-191.



A slight modification of the experimental design can cause a substantial knowledge gain in non-clinical studies with female rats

Monika Brüning1, Bernd Baier2, Bernd-Wolfgang Igl1

1Global Biostatistics and Data Sciences, Boehringer Ingelheim Pharma GmbH & Co. KG, Germany; 2Nonclinical Drug Safety, Boehringer Ingelheim Pharma GmbH & Co. KG, Germany

In pharmaceutical industry, various non-clinical studies are performed to secure the safety of potential drug candidates for use in humans. To address potential effects on fertility of patients so called “Fertility and Early Embryonic Developmental” (FEED) animal studies are conducted. Therein, the detection of possible effects of the test item on regular estrous cyclicity of about 4 days in female rats is a relevant endpoint, which is usually measured by vaginal swaps and cytological analyses of the estrous stage before and during treatment.

As estrous cyclicity causes substantial changes in the general activity level of female rats, we compared the estrous cycle data with body weight changes. Body weight is a highly relevant and informative endpoint in toxicological in vivo studies usually taken at least to times per week in rodents. To increase the time resolution during the estrous cycle we took daily body weights.

We now have analyzed the possibility to estimate relevant information on the female reproductive status by simply weighing the animals daily. This has several positive consequences for the conducting laboratory and is also of concern from an animal welfare perspective. In addition, it might not only be relevant in dedicated FEED studies but could be applied more generally in studies on female rats: with simple body weight measurements the complex hormonal and behavioral changes could be monitored allowing for a better interpretation of data on an individual and population level.



Prediction intervals for overdispersed Poisson data and their application to historical controls

Max Menssen

Leibniz Universität Hannover, Germany

Toxicological studies are a class of biological trials which are aimed to evaluate the toxicological properties of chemical compounds on model organisms. Typical toxicological studies are comprised of an untreated control group and several cohorts that were treated with the compound of interest.

If the same type of study is run for several times, using the same model organism and the same experimental setup, the knowledge about the baseline reaction obtained from the (historical) untreated control groups is rising with every new trial. Hence, most guidelines from the OECD require the verification of the actual control group based on historical control data. This can be done by the application of prediction intervals.

In several study-types such as the Ames assay, the endpoint of interest is comprised of counted observations (e.g. number of reverant bacteria colonies per petridish). In this case, the historical control data can be modelled to be overdispersed Poisson. Hence, an asymptotic prediction interval that allows for overdispersion is proposed. Furthermore, it will be demonstrated how to use bootstrap calibration in order to enhance the intervals small sample properties.

The proposed methodology is implemented in the R package predint and its application will be demonstrated based on a real life data set.



Minimum Volume Confidence Set Optimality for Simultaneous Confidence Bands for Percentiles in Linear Regression

Lingjiao Wang1, Yang Han1, Wei Liu2, Frank Bretz3

1Department of Mathematics, University of Manchester, UK; 2School of Mathematical Sciences & Southampton Statistical Sciences Research Institute, University of Southampton, UK; 3Novartis Pharma AG, Basel, Switzerland

Simultaneous confidence bands for a percentile line in linear regression have been considered by several authors and the average width of a simultaneous confidence band has been widely used as a criterion for the comparison of different confidence bands. In this work, exact symmetric and asymmetric simultaneous confidence bands over finite covariate intervals are considered, and the area of the confidence set that corresponds to a confidence band is used as the criterion for the comparison. The optimal simultaneous confidence band is found under the minimum area confidence set (MACS) or minimum volume confidence set (MVCS) criterion. The area of corresponding confidence sets for asymmetric simultaneous confidence bands is uniformly and can be very substantially smaller than that for the corresponding exact symmetric simultaneous confidence bands. Therefore, asymmetric simultaneous confidence bands should always be used under the MACS criterion. A real data example is included for illustration.



Use of clinical tolerance limits for assessing agreement

Patrick Taffé

University of Lausanne (UNIL), Switzerland

Bland & Altman’s limits of agreement (LoA) is one of the most used statistical methods to assess the agreement between two measurement methods (for example between two biomarkers). This methodology, however, does not directly assess the level of agreement and it is up to the investigator to decide whether or not disagreement is too high for the two methods to be deemed to be interchangeable.

To more directly quantify the level of agreement, Lin et al. (2002) have proposed the concept of « coverage probability », which is the probability that the absolute difference between the two measurements made on the same subject is less than a pre-defined level. Their methodology, however, does not take into account that the level of agreement might depend on the value of the true latent trait. In addition, homoscedastic measurement errors are implicitly assumed (an often too strong assumption) and the presence of a possible bias is not assessed. For these reasons, Stevens et al. (2017, 2018) have extended this methodology to allow the coverage probability to depend on the value of the latent trait, as well as on the amount of bias, and called their extended agreement concept « probability of agreement ».

In this study, we have further extended this methodology by relaxing the strong parametric assumptions regarding the distribution of the latent trait and developing inference methods allowing to compute both pointwise and simultaneous confidence bands. Our methodology requires repeated measurements for at least one of the two measurement methods and accommodates heteroscedastic measurement errors. It performs often very well even with only one measurement for one of the two measurement methods and at least 5 repeated measurements from the other. It circumvents some of the deficiencies of LoA and provides a more direct assessment of the agreement level.

References

  1. Lin L, Hedayat AS, Sinha B, and Yang M. Statistical methods in assessing agreement: models, issues, and tools. JASA 2002; 97: 257-270.
  2. Stevens NT, Steiner SH, and MacKay RJ. Assessing agreement between two measurement systems: an alternative to the limits of agreement approach. Stat Meth Med Res 2017; 26: 2487-2504.
  3. Stevens NT, Steiner SH, and MacKay RJ. Comparing heteroscedastic measurement systems with the probability of agreement. Stat Meth Med Res 2018; 27: 3420-3435.
  4. Taffé P. Use of clinical tolerance limits for assessing agreement. Stat Meth Med Res 2023; 32: 195-206.


Decomposition of Explained Variation in the Linear Mixed Model

Nicholas Schreck, Manuel Wiesenfarth

DKFZ Heidelberg, Germany

The concepts of variance decomposition and explained variation are the basis of relevance assessments of factors in ANOVA, and lead to the definition of the widely applied coefficient of determination in the linear model. In the linear mixed model, the assessment and comparison of the
dispersion relevance of explanatory variables associated with fixed and random effects still remains an important open practical problem. To fill this gap, our contribution is two-fold. Firstly, based on the restricted maximum likelihood equations in the variance components form of the linear mixed model, we prove a proper decomposition of the sum of squares of the dependent variable into unbiased estimators of interpretable estimands of explained variation. Our result leads us to propose a natural extension of the well-known
adjusted coefficient of determination to the linear mixed model. Secondly, we allocate the novel unbiased estimators of explained variation to specific contributions of covariates associated with fixed and random effects within a single model fit. These parameter-wise explained variations constitute easily interpretable quantities, assessing dispersion relevance of covariates associated with both fixed and random effects on a common scale, and thus allowing for a covariate ranking. Our approach is made readily available in the user-friendly R-package ``explainedVariation'' and its usefulness is illustrated in public datasets.



Text classification to automate abstract screening using machine learning

Johannes A. Vey1, Samuel Zimmermann1, Maximilian Pilz1,2

1Institute of Medical Biometry, University of Heidelberg, Germany; 2Department Optimization, Fraunhofer Institute for Industrial Mathematics (ITWM), Germany

Systematic reviews synthesize all available evidence on a specific research question. A paramount task in this is the comprehensive literature search, which should be as extensive as possible to identify all relevant studies and reduce the risk of reporting bias. The identified studies need to be screened according to defined inclusion criteria to address the research question. As a consequence, screening the identified studies is time-consuming, resource intensive and tedious for all researchers involved. In the first stage of this process, the title-abstract screening (TIAB), abstracts of all initially identified studies are screened and classified regarding their inclusion or exclusion for full-text screening. Conventionally, this is accomplished by two independent human reviewers. In the last years, there has been some research to automate the literature search and screening processes [1-2].

We present a semi-automated approach to TIAB screening using natural language processing (NLP) and machine learning (ML) based classification that was applied within a systematic review project on the reduction of surgical site infection incidence in elective colorectal resections.

The total 4460 identified abstracts were randomly split into a training (1/3) and test set (2/3). The titles and abstracts of the publications were processed by methods of NLP to transform the plain language into numerical matrices. Based on the processed training data, variable selection conducting Elastic Net regularized regression was performed. Subsequently, different ML algorithms (Elastic Net, Support Vector Machine, Random Forest, and Light Gradient Boosting Machine) were trained using 5-fold cross-validation and grid search for the respective tuning parameters. The AUC value was used as an optimization criterion and the decision of the two human reviewers was used as the reference. The algorithms were evaluated on the test set.

The Random Forest showed the highest performance in the test set (AUC: 96%). Choosing a cut-off to avoid missing any relevant abstract (n=136) resulted in only 755 false positives (FP rate: 26.5%). Conversely, 2089 abstracts were correctly classified as to be excluded (FN rate: 0%). Further investigations were done on the minimal number of abstracts needed to validly train the Random Forest. In our case study, the manual TIAB screening workload for the second reviewer could be reduced by about 70%.

We propose an approach where a ML model can replace one human reviewer after being trained on a sufficient number of abstracts. The second reviewer only needs to get involved in cases of discrepancies between the decision of the first reviewer and the classification model.

Our ML-based text classification approach proved to be powerful, adaptable, and it considerably reduced human workload in creating systematic reviews.

  1. O’Mara-Eves, A., Thomas, J., McNaught, J. et al. Using text mining for study identification in systematic reviews: a systematic review of current approaches. Syst Rev 4, 5 (2015).
  2. Lange, T, Schwarzer, G, Datzmann, T, Binder, H. Machine learning for identifying relevant publications in updates of systematic reviews of diagnostic test studies. Res Syn Meth. 2021; 12: 506– 515.


New formulae for the meta analysis of odds ratios from observational studies with classed continuous exposure

Reinhard Vonthein

Universität zu Lübeck, Germany

Observational studies with continuous exposure often report dichotomous outcomes by exposure class, where exposure classes are delimited by quantiles, e.g. quartiles. The obvious meta analysis, namely mixed logistic regression, will rely on exposure values that are representative of these classes. Such representatives could be chosen as middle of the class in terms of exposure or of probability, like uneven octiles which require the assumption of a distribution. They could be estimated means of that distribution truncated at the quantiles or, as a partly distribution-free method, be calculated by the “trapezoidal rule”, i.e. as expectation of a truncated distribution with a straight line connecting the normal density at the quantiles.
First, some formulae are presented to estimate moments of the latent exposure distribution from quantiles complementing those formulae published before (Wan et al. 2014). Then, formulae are given for the method of moments estimation of means of truncated distributions. After that, the different options to choose representative values are discussed. Finally, they are compared in a simulation study. Scenarios of the simulation study are inspired by a real application with OR of 1.4 estimated from 12 articles, the practical problems of which will beare presented at the GMDS2023. Although some articles reported clearly lognormal exposure, misspecification is included, so that the merit of the partly distribution-free method becomes clearer.
Only the trapezoidal rule and the estimation of means of truncated distributions give representatives that are confined to the class they should represent. When data were generated according to the assumed distribution, the distribution-free method had a higher standard error, but was more robust under misspecification.

Wan X, Wang W, Liu J, Tong T. Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range. Medical Research Methodology 2014, 14:135 http://www.biomedcentral.com/1471-2288/14/135



How to statistically model biologic interactions

Carolin Malsch

University of Greifswald, Germany

Even after decades of statistical modeling, there is still no clear concept how to assess interaction effects of a set of binary factors on a binary response variable. The reason for this seems to be a lack of clarity about how absence of biologic interaction is modeled.

Biological interaction between two risk factors is often understood as either a deviation from additivity of the absolute effects of two (or more) factors, or non-zero coefficients for interaction terms in binary logistic regression. Both approaches are incorrect.

The mathematically adequate concept for modeling biological (non-)interaction in the given context is stochastic (in-)dependence. Hence, strategies and software recommendations provided in the literature to date are misleading and need correction.

Affected by this misunderstanding is also how logistic regression analysis, the most common approach to model the joint effect of two or more factors on a binary response variable in health research, is conducted in application. In most cases, only main effects are estimated in the regression function while interaction terms are omitted completely. Only sometimes a selection of interaction terms is taken into account with the aim to assess biologic interaction.

In the binary logistic regression model, interaction terms do not reflect biological interactions in general. For example, they are inevitably needed to model stochastic, and thus biologic, independence. On the other hand, coefficients of interaction terms take on value zero when a special type of stochastic dependence is present. More precisely, when the conditional probability under presence of multiple factors is described properly by the logistic regression function including only main effects. This form of dependency is always of synergistic character. However, such a strong assumption is certainly valid in the rarest of cases.

Missing out on interaction terms in the logistic regression model leads to severely biased estimates and easily causes misleading interpretation. This is particularly worrying given that results from studies in epidemiology, health services and public health eventually affect clinical and public health recommendations.

To resolve these problems, this contribution seeks to clarify (a) how biologic interactions in a data set are correctly assessed using stochastic (in-)dependence, (b) why interaction terms in a binary regression model must be considered in the regression function and (c) which value they take on in case of absence of biologic interaction.

The related theory is presented and demonstrated on examples. Further, other approaches to assess biologic interactions from data are critically discussed.



Alternative allocation ratios leading to more cost-efficient biosimilar trial designs

Natalia Krivtsova, Rachid El Galta, Jessie Wang, Arne Ring

Hexal AG, Germany

Objective

Balanced randomization ratios are standard in clinical trials (e.g. in a parallel 2-group design balanced randomization leads to lowest total sample size), but not mandatory. The costs of some biologic treatments used as comparators in biosimilar trials are very high. Exploring different allocation ratios could lead to lower total trial costs while maintaining the same study power.

Methods

Good cost estimates for drug products and study conduct are required to initiate the calculation. We are separating the per-patient study conduct costs (CostConduct) from the non-patient related costs (CostNon-patient) with total trial cost for Phase III trial calculated as:

Costtotal=NTest*CostTest+NRef*CostRef+(NTest+NRef)*CostConduct+CostNon-patient

When the costs of two drug treatment (CostTest and CostRef) are very different, the patient numbers NTest and NRef could be adapted while maintaining the same study power. The objective of this research is to identify the optimal allocation ratio R=NTest/NRef that leads to the smallest total cost.

In biosimilar trials, the equivalence testing is performed using the two one-sided tests (TOST) procedure. The primary endpoint is efficacy endpoint (i.e. response rate) for Phase III trial and PK endpoint (i.e. AUCinf) for Phase I trial, respectively, in the example below.

For Phase I, where there are two reference drugs and two allocation ratios, optimal allocation ratios for 3-way similarity were calculated in R by:

  • determining the required sample sizes for a range of allocation ratios using simulations,
  • evaluating the cost function for each and
  • finding the allocation ratios yielding the lowest cost.

Results

The method is illustrated using fictive, but realistic numerical examples of Phase I and Phase III trials.

Phase III trial assumptions included: power of 90%, equivalence margin of 10% with reference arm expected response rate of 70%, CostTest=5kEUR, CostRef=50kEUR, CostConduct=40kEUR and CostNon-patient=25mEUR.

In a balanced design 1214 subjects were required (leading to 107mEUR total cost) and optimal R=1.4 in cost-optimized design led to total trial cost reduction of 2.34mEUR. More practical R=1.5 (3:2 allocation ratio) required 1265 (759, 506) subjects (4% increase) and led to total trial cost reduction of 2.25mEUR (2.1%).

Phase I trial assumptions included: power of 90%, CV of 35%, CostTest=3kEUR, CostRefEU=30kEUR, CostRefUS=60kEUR, CostConduct=20kEUR and CostNon-patient=20mEUR.

In a balanced design 279 subjects were required (29.2mEUR total cost) and in cost-optimized design with 3:2:2 randomization ratio – 280 (120, 80, 80), with just one subject more and a total trial cost reduction of 1.07mEUR (3.7%).

Discussion

A fixed optimized randomization ratio does not require any adjustment of analysis methods. It is advantageous in terms of increasing own safety data base (as number of patients on biosimilar drug is larger). There will be additional considerations regarding the practicality of the implementation and the acceptance of the patients. For Phase III it will also mean slightly prolonged trial duration (due to 4% increase in total sample size).



Exploring the spatiotemporal relationship between air pollution and meteorological conditions in Baden-Württemberg (Germany)

Leona Hoffmann1, Lorenza Gilardi2, Marie-Therese Schmitz3, Thilo Ebertseder2, Michael Bittner2, Sabine Wüst2, Matthias Schmid3, Jörn Rittweger1,4

1Institute of Aerospace Medicine, German Aerospace Center (DLR), Cologne, Germany; 2German Remote Sensing Data Center, German Aerospace Center (DLR), Oberpfaffenhofen, Germany; 3Institut of Medical Biometry, Informatics and Epidemiology, University Hospital Bonn, Bonn, Germany; 4Department of Pediatrics and Adolescent Medicine, University Hospital Cologne, Cologne, Germany

Background: In the Epidemiological analysis of health data as the response variable for environmental stressors, a key question is to understand the interdependencies between environmental variables and which variables to include in the statistical model. Various meteorological (temperature, ultraviolet radiation, precipitation, and vapor pressure) and air pollution variables (O3, NO2, PM2.5, and PM10) are available at the daily level for Baden-Württemberg (Germany). This federal state covers both urban and rural areas.

Methods: A spatial and temporal analysis of the internal relationships is performed using a) cross-correlations, both on the grand ensemble of data as well as with subsets, and b) the Local Indications of Spatial Association (LISA).

Results: Meteorological and air pollution variables are strongly correlated between and among themselves, with specific seasonal and spatial features. For example, Nitrogen dioxide and Ozone are strongly interdependent, and the Pearson correlation coefficient varies with time. In January, there is a negative correlation of -0.84 whereas in April, the correlation coefficient is -0.47, in July 0.45, and in October -0.54. For Ozone and Nitrogen dioxide, a shift of the correlation direction as a function of temperature and UV radiation can be observed, confirmed by cross-correlation. Spatially, NO2, PM2.5, and PM10 concentrations are significantly higher in urban than rural regions. For O3, this effect is reversed. As confirmed also by LISA analysis, where distinct hot and cold spots of the different environmental stressors could be identified. In addition, a linear regression analysis suggests that PM10 variation is almost entirely explained by PM2.5 and vapor pressure by temperature.

Conclusion: The results found are generally compatible with the expected dependencies. Thus, our investigation demonstrates that there are variables with similar temporal and spatial characteristics that should be adequately addressed in analyses of health and environmental stressors. Simplification strategies could e.g. discard redundant variables such as PM10 when PM2.5 is available. However, a reduction to one single variable is not helpful due to the complex relationships between meteorological and air pollution variables.



Psborrow: an R package for complex, innovative trial designs using Bayesian dynamic borrowing

Matthew Secrest2, Isaac Gravestock1, Jiawen Zhu2, Herb Pang2, Daniel Sabanes Bove1

1F. Hoffman-La Roche, Switzerland; 2Genentech, South San Francisco, CA, USA

While the randomized controlled trial (RCT) comparing experimental and control arms remains the gold standard for evaluating the efficacy of a novel therapy, one may want to leverage relevant existing external control data to inform the study outcome. For instance, in certain indications such as rare diseases, it may be difficult to enroll sufficient patients to adequately power an RCT. External control data can help increase study power in these situations. External control data can also benefit trials by shortening trial duration or enabling more subjects to receive the experimental therapy. However, analysis of external control data can also introduce bias in the event that the RCT control arm and external control arm are incomparable (e.g., because of confounding, different standards of care, etc). One method for incorporating external control data that mitigates bias is Bayesian dynamic borrowing (BDB). In BDB, information from the external control arm is borrowed to the extent that the external and RCT control arms have similar outcomes.

The implementation of BDB is computationally involved and typically requires Markov chain Monte Carlo (MCMC) sampling methods. To overcome these technical barriers and accelerate the adoption of BDB, we developed the open-source R package ‘psborrow2’. The package has two main goals: First, ‘psborrow2’ provides a user-friendly interface for analyzing data with BDB without the need for the user to compile an MCMC sampler. Second, ‘psborrow2’ provides a framework for conducting simulation studies to evaluate the impact of different trial and BDB parameters (e.g., sample size, covariates) on study power, type I error, and other operating characteristics.

psborrow2’ is an open-source package hosted on GitHub (github.com/Genentech/psborrow2) which is freely available for public use. One important focus of ‘psborrow2’ development has been modular functions and classes to simplify the user experience and, importantly, promote collaboration with the broader statistical community. New methods are easily incorporated into the package, and users are encouraged to consider contributing. We made the package accessible to people who are working in clinical trial design or analysis who have some familiarity with hybrid control concepts.



Implementation science what, why, how and some statistical challenges

Nathalie Barbier, David Lawrence, Cormelia Dunger-Baldauf, Andrew Bean

Novartis, Switzerland

Interventions and evidence-based practices that are poorly implemented often do not produce expected health benefits.

Implementation science (IS) is the scientific study of methods and strategies that facilitate the uptake of evidence-based practice and research into regular use by practitioners and policymakers. The goal of IS is not to establish the health impact of a clinical innovation e.g. a new treatment, but rather to identify and address the factors that affect its uptake into routine use.

Different non-standard study designs are often used to provide evidence on the impact of strategies to improve uptake. This poster will outline some of these designs and the statistical challenges they may entail, for example, if no control group is used, causal inference issues or challenges with missing values.



Directed acyclic graphs for preclinical research

Collazo Anja

Berlin Institute of Health, Germany

A central goal of preclinical trials is to reduce the uncertainty about the direction and magnitude of an effect, which is causally attributed to an exposure or intervention of interest. Yet, without high internal validity in estimating outcomes, the experimental results remain either scientifically inconclusive - wasting laboratory animals and resources – or lead to erroneous decisions-making potentially harming patients. Thus, experiments with low scientific rigor raise major ethical concerns. Directed acyclic graphs are visual representations of simplified, key components of a hypothesized causal structure widely used in epidemiological observational research to clarify types and mechanism of biases. Here, we introduce directed acyclic graphs as a tool to improve bias detection in preclinical experiments and argue that causal models can help scientific communication and transparency in preclinical research. We present examples for biased effect estimates from preclinical research expressed in DAGs such as confounding and collider stratification. We show how DAGs can inform the choice of study design, the selection of the smallest subset of variables for measurement and guide the analysis strategy.



 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: CEN 2023
Conference Software: ConfTool Pro 2.6.151+TC
© 2001–2024 by Dr. H. Weinreich, Hamburg, Germany