Multi-armed bandit with endogenous learning and queueing: An application to split liver transplantation
Yanhan Tang, Andrew Li, Alan Scheller-Wolf, Sridhar Tayur
Carnegie Mellon University, United States of America
We enhance the multi-armed bandit model by considering endogenously non-stationary rewards – specifically rewards that are parametric functions of policy histories (learning). We further incorporate queueing costs, fairness, and arm correlation. We propose the L-UCB, FL-UCB, and QFL-UCB algorithms to solve our model, prove its logarithmic regret, and apply it to split-liver transplantation.
Bandits with Time-to-Event Outcomes
Arielle Elissa Anderer1, John Silberholz2, Hamsa Bastani1
1The Wharton School, University of Pennsylvania, United States of America; 2University of Michigan, United States of America
We adapt online learning techniques to scenarios with time-to-event data, where there is a delay between choosing an arm and observing feedback that is endogenous to the quality of the arm. We posit a multi-armed bandit algorithm with a cox-proportional hazards estimator, prove guarantees on the regret under this algorithm, and analyze its performance on a dataset of metastatic breast cancer clinical trials, comparing it to that of other adaptive allocation schemes.
Targeted interventions for TB treatment adherence via reinforcement learning
Jackie Baek1, Justin Boutilier2, Vivek Farias1, Jonas Oddur Jonasson1
1Massachusetts Institute of Technology; 2University of Wisconsin-Madison
Lack of treatment adherence significant barrier to reducing the global disease burden of tuberculosis (TB). We study the design of targeted interventions for a treatment adherence support platform running in Kenya, whose goal is to help patients on TB treatment. We show empirically that there is large heterogeneity in treatment effects of interventions, and we devise a novel online learning policy based on Thompson Sampling that significantly outperforms the currently employed policy.
|