# stochastic-bandits-with-context-distributions

《stochastic-bandits-with-context-distributions》由会员分享，可在线阅读，更多相关《stochastic-bandits-with-context-distributions（10页珍藏版）》请在凡人图书馆上搜索。

1、Stochastic Bandits with Context Distributions Johannes Kirschner Department of Computer Science ETH Zurich jkirschnerinf.ethz.ch Andreas Krause Department of Computer Science ETH Zurich krauseaethz.ch Abstract We introduce a stochastic contextual bandit model where at each time step the environment

2、chooses a distribution over a context set and samples the context from this distribution. The learner observes only the context distribution while the exact context realization remains hidden. This allows for a broad range of applications where the context is stochastic or when the learner needs to

3、predict the context. We leverage the UCB algorithm to this setting and show that it achieves an order-optimal high-probability bound on the cumulative regret for linear and kernelized reward functions. Our results strictly generalize previous work in the sense that both our model and the algorithm r

4、educe to the standard setting when the environment chooses only Dirac delta distributions and therefore provides the exact context to the learner. We further analyze a variant where the learner observes the realized context after choosing the action. Finally, we demonstrate the proposed method on sy

5、nthetic and real-world datasets. 1 Introduction In the contextual bandit model a learner interacts with an environment in several rounds. At the beginning of each round, the environment provides a context, and in turn, the learner chooses an action which leads to an a priori unknown reward. The lear

6、ners goal is to choose actions that maximize the cumulative reward, and eventually compete with the best mapping from context observations to actions. This model creates a dilemma of exploration and exploitation, as the learner needs to balance exploratory actions to estimate the environments reward

7、 function, and exploitative actions that maximize the total return. Contextual bandit algorithms have been successfully used in many applications, including online advertisement, recommender systems and experimental design. The contextual bandit model, as usually studied in the literature, assumes t

8、hat the context is observed exactly. This is not always the case in applications, for instance, when the context is itself a noisy measurement or a forecasting mechanism. An example of such a context could be a weather or stock market prediction. In other cases such as recommender systems, privacy c

9、onstraints can restrict access to certain user features, but instead we might be able to infer a distribution over those. To allow for uncertainty in the context, we consider a setting where the environment provides a distribution over the context set. The exact context is assumed to be a sample fro

10、m this distribution, but remains hidden from the learner. Such a model, to the best of our knowledge, has not been discussed in the literature before. Not knowing the context realization makes the learning problem more difficult, because the learner needs to estimate the reward function from noisy o

11、bservations and without knowing the exact context that generated the reward. Our setting recovers the classical contextual bandit setting when the context distribution is a Dirac delta distribution. We also analyze a natural variant of the problem, where the exact context is observed after the playe

12、r has chosen the action. This allows for different applications, where at the time of decision the context needs to be predicted (e.g. weather conditions), but when the reward is obtained, the exact context can be measured. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Van

13、couver, Canada. We focus on the setting where the reward function is linear in terms of action-context feature vectors. For this case, we leverage the UCB algorithm on a specifically designed bandit instance without feature uncertainty to recover anO(dpT) high-probability bound on the cumulative reg

14、ret. Our analysis includes a practical variant of the algorithm that requires only sampling access to the context distributions provided by the environment. We also extend our results to the kernelized setting, where the reward function is contained in a known reproducing kernel Hilbert space (RKHS)

15、. For this case, we highlight an interesting connection to distributional risk minimization and we show that the natural estimator for the reward function is based on so-called kernel mean embeddings. We discuss related work in Section 6. 2 Stochastic Bandits with Context Distributions We formally d

16、efine the setting of stochastic bandits with context distributions as outlined in the introduction. LetX be a set of actions andCa context set. The environment is defined by a fixed, but unknown reward function f :X C!R. At iteration t2N, the environment chooses a distribution t2P(C) over the contex

17、t set and samples a context realization ct t. The learner observes only t but not ct, and then chooses an action xt2X. We allow that an adaptive adversary chooses the context distribution, that is t may in an arbitrary way depend on previous choices of the learner up to time t. Given the learners ch

18、oice xt, the environment provides a reward yt = f(xt;ct) + t, where t is -subgaussian, additive noise. The learners goal is to maximize the cumulative rewardP T t=1f(xt;ct), or equivalently, minimize the cumulative regret RT = TX t=1 f(x t;ct) f(xt;ct) (1) wherex t = arg maxx2XEc tf(x;c) is the best

19、 action provided that we knowf and t, but notct. Note that this way, we compete with the best possible mapping :P(C)!X from the observed con- text distribution to actions, that maximizes the expected rewardPTt=1 Ect tf( ( t);ct)jFt 1; t whereFt = f(xs; s;ys)gts=1 is the filtration that contains all

20、information available at the end of round t. It is natural to ask if it is possible to compete with the stronger baseline that chooses actions given the context realization ct, i.e. x t = arg maxx2Xf(x;ct). While this can be possible in special cases, a simple example shows, that in general the lear

21、ner would suffer (T) regret. In particular, assume that ct Bernoulli(0:6) , andX =f0;1g. Let f(0;c) = c and f(1;c) = 1 c. Clearly, any policy that does not know the realizations ct, must have (T) regret when competing against x t. From now on, we focus on linearly parameterized reward functions f(x;

22、c) = x;c with given feature vectors x;c 2Rd for x2X and c2C, and unknown parameter 2Rd. This setup is commonly referred to as the linear bandit setting. For the analysis we require standard boundedness assumptionsk x;ck2 1 andk k2 1 that we set to 1 for the sake of simplicity. In Section 4.2, we fur

23、ther consider a variant of the problem, where the learner observes ct after taking the decision xt. This simplifies the estimation problem, because we have dataf(xt;ct;yt)gwith exact context ct available, just like in the standard setting. The exploration problem however remains subtle as at the tim

24、e of decision the learner still knows only t and not ct. In Section 4.3 we extend our algorithm and analysis to kernelized bandits where f2His contained in a reproducing kernel Hilbert spaceH. 3 Background We briefly review standard results from the linear contextual bandit literature and the upper

- 配套讲稿：
如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。

- 特殊限制：
部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。

- 关 键 词：
- stochastic bandits with context distributions