# strategizing-against-no-regret-learners

《strategizing-against-no-regret-learners》由会员分享，可在线阅读，更多相关《strategizing-against-no-regret-learners（9页珍藏版）》请在凡人图书馆上搜索。

1、Strategizing against No-regret Learners Yuan Deng Duke University ericdycs.duke.edu Jon Schneider Google Research Balasubramanian Sivan Google Research Abstract How should a player who repeatedly plays a game against a no-regret learner strategize to maximize his utility? We study this question and

2、show that under some mild assumptions, the player can always guarantee himself a utility of at least what he would get in a Stackelberg equilibrium of the game. When the no-regret learner has only two actions, we show that the player cannot get any higher utility than the Stackelberg equilibrium uti

3、lity. But when the no-regret learner has more than two actions and plays a mean-based no-regret strategy, we show that the player can get strictly higher than the Stackelberg equilibrium utility. We provide a characterization of the optimal game-play for the player against a mean-based no-regret lea

4、rner as a solution to a control problem. When the no-regret learners strategy also guarantees him a no-swap regret, we show that the player cannot get anything higher than a Stackelberg equilibrium utility. 1 Introduction Consider a two player bimatrix game with a nite number of actions for each pla

5、yer repeated overT rounds. When playing a repeated game, a widely adopted strategy is to employ a no-regret learning algorithm: a strategy that guarantees the player that in hindsight no single action when played throughout the game would have performed signicantly better. Knowing that one of the pl

6、ayers (the learner) is playing a no-regret learning strategy, what is the optimal gameplay for the other player (the optimizer)? This question is the focus of our work. If this were a single-shot strategic game where learning is not relevant, a (pure or mixed strategy) Nash equilibrium is a reasonab

7、le prediction of the games outcome. In theT rounds game with learning, can the optimizer guarantee himself a per-round utility of at least what he could get in a single-shot game? Is it possible to get signicantly more utility than this? Does this utility depend on the specic choice of learning algo

8、rithm of the learner? What gameplay the optimizer should adopt to achieve maximal utility? None of these questions are straightforward, and indeed none of these have unconditional answers. Our results. Central to our results is the idea of the Stackelberg equilibrium of the underlying game. The Stac

9、kelberg variant of our game is a single-shot two-stage game where the optimizer is the rst player and can publicly commit to a mixed strategy; the learner then best responds to this strategy. The Stackelberg equilibrium is the resulting equilibrium of this game when both players play optimally. Note

10、 that the optimizers utility in the Stackelberg equilibrium is always weakly larger than his utility in any (pure or mixed strategy) Nash equilibrium, and is often strictly larger. LetV be the utility of the optimizer in the Stackelberg equilibrium. With some mild assumptions on the game, we show th

11、at the optimizer can always guarantee himself a utility of at least (V )T o(T ) inT rounds for any 0, irrespective of the learning algorithm used by the learner as long as it has the no-regret guarantee (Theorem 4). This means that if one of the players is a learner the other player can already prot

12、 over the Nash equilibrium regardless of the specics of the learning algorithm employed or the structure of the game. Further, if any one of the following conditions is true: 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.1. the game is a constant-sum game

13、, 2. the learners no-regret algorithm has the stronger guarantee of no-swap regret (see Section 2), 3. the learner has only two possible actions in the game, the optimizer cannot get a utility higher thanVT +o(T ) (see Theorem 5, Theorem 6, Theorem 7). If the learner employs a learning algorithm fro

14、m a natural class of algorithms called mean-based learning algorithms Braverman et al., 2018 (see Section 2) that includes popular no-regret algorithms like the Multiplicative Weights algorithm, the Follow-the-Perturbed-Leader algorithm, and the EXP3 algorithm, we show that there exist games where t

15、he optimizer can guarantee himself a utility V 0 T o(T ) for someV 0 V (see Theorem 8). We note the contrast between the cases of 2 and 3 actions for the learner: in the 2-actions case even if the learner plays a mean-based strategy, the optimizer cannot get anything more thanVT +o(T ) (Theorem 7),

16、whereas with 3 actions, there are games where he is able to guarantee a linearly higher utility. Given this possibility of exceeding Stackelberg utility, our nal result is on the nature and structure of the utility optimal gameplay for the optimizer against a learner that employs a mean-based strate

17、gy. First, we give a crisp characterization of the optimizers asymptotic optimal algorithm as the solution to a control problem (see Section 4.2) inN dimensions whereN is the number of actions for the learner. This characterization is predicated on the fact that just knowing the cumulative historica

18、l utilities of each of the learners actions is essentially enough information to accurately predict the learners next action in the case of a mean-based learner. TheseN cumulative utilites thus form an N-dimensional “state” for the learner which the optimizer can manipulate via their choice of actio

19、n. We then proceed to make multiple observations that simplify the solution space for this control problem. We leave as a very interesting open question of computing or characterizing the optimal solution to this control problem and we further provide one conjecture of a potential characterization.

20、Comparison to prior work. The very recent work of Braverman et al. 2018 is the closest to ours. They study the specic 2-player game of an auction between a single seller and single buyer. The main difference from Braverman et al. 2018 is that they consider a Bayesian setting where the buyers type is

21、 drawn from a distribution, whereas there is no Bayesian element in our setting. But beyond that the sellers choice of the auction represents his action, and the buyers bid represents her action. They show that regardless of the specic algorithm used by the buyer, as long as the buyer plays a no-reg

22、ret learning algorithm the seller can always earn at least the optimal revenue in a single shot auction. Our Theorem 4 is a direct generalization of this result to arbitrary games without any structure. Further Braverman et al. 2018 show that there exist no-regret strategies for the buyer that guara

23、ntee that the seller cannot get anything better than the single-shot optimal revenue. Our Theorems 5, 6 and 7 are both generalizations and renements of this result, as they pinpoint both the exact learners strategies and the kind of games that prevent the optimizer from going beyond the Stackelberg

24、utility. Braverman et al. 2018 show that when the buyer plays a mean-based strategy, the seller can design an auction to guarantee him a revenue beyond the per round auction revenue. Our control problem can be seen as a rough parallel and generalization of this result. Other related work. The rst no

- 配套讲稿：
如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。

- 特殊限制：
部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。

- 关 键 词：
- strategizing against no regret learners