Active Learning for Stochastic Contextual Linear Bandits
Artificial Intelligence and Statistics (AISTATS) | NeurIPS Workshop on Aligning Reinforcement Learning Experimentalists and Theorists | Oral at Stanford Causal Science Conference, 2026
Prior algorithms for stochastic contextual banditys strategically sample actions but naively (passively) sampling contexts from the underlying context distribution. But in many practical societal scenarios—including survey research, clinical trials, and education—practitioners can actively sample or recruit contexts based on prior knowledge of the contexts. Despite this potential for active learning in healthcare and education contexts, strategic context sampling in stochastic contextual bandits is underexplored. We propose an algorithm that learns a near-optimal policy by strategically sampling rewards of context-action pairs. We prove it enjoys improved instance-dependent guarantees and demonstrate empirically that our algorithm reduces the number of samples needed to learn a near-optimal policy for personalized blood-thiner dosage recommendations.