# superposition-of-many-models-into-one

《superposition-of-many-models-into-one》由会员分享，可在线阅读，更多相关《superposition-of-many-models-into-one（10页珍藏版）》请在凡人图书馆上搜索。

1、Superpositionofmanymodelsintoone BrianCheung Redwood Center, BAIR UC Berkeley bcheungberkeley.edu AlexTerekhov Redwood Center UC Berkeley aterekhovberkeley.edu YubeiChen Redwood Center, BAIR UC Berkeley yubeicberkeley.edu PulkitAgrawal BAIR UC Berkeley pulkitagberkeley.edu BrunoOlshausen Redwood Cen

2、ter, BAIR UC Berkeley baolshausenberkeley.edu Abstract We present a method for storing multiple models within a single set of parame- ters. Models can coexist in superposition and still be retrieved individually. In experiments with neural networks, we show that a surprisingly large number of models

3、 can be effectively stored within a single parameter instance. Furthermore, each of these models can undergo thousands of training steps without signicantly interfering with other models within the superposition. This approach may be viewed as the online complement of compression: rather than reduci

4、ng the size of a network after training, we make use of the unrealized capacity of a network duringtraining. 1 Introduction While connectionist models have enjoyed a resurgence of interest in the articial intelligence commu- nity, it is well known that deep neural networks are over-parameterized and

5、 a majority of the weights can be prunedafter training 7, 20, 3, 8, 9, 1. Such pruned neural networks achieve accuracies similar to the original network but with much fewer parameters. However, it hasnotbeenpossible to exploit this redundancy to train a neural network with fewer parameters from scra

6、tch to achieve accuracies similar to its over-parameterized counterpart. In this work we show that it is possible to partially exploit the excess capacity present in neural network modelsduringtraining by learning multiple tasks. Suppose that a neural network with L parameters achieves desirable acc

7、uracy at a single task. We outline a method fortrainingasingleneuralnetwork withL parameters to simultaneously performK different tasks and thereby effectively requiringO L K parameters per task. While we learn a separate set of parameters W k ; k 2 1;K for each of the K tasks, these parameters are

8、stored in superposition with each other, thus requiring approximately the same number of parameters as a model for a single task. The task-specic models can be accessed using task-specic “context” informationC k that dynamically “routes” an input towards a specic model retrieved from this superposit

9、ion. The model parametersW can be therefore thought of as a “memory and the contextC k as “keys that are used to access the specic parametersW k required for a task. Such an interpretation is inspired by Kanervas work on hetero-associative memory 4. Because the parameters for different tasks exist i

10、n super-position with each other and are constantly changing during training, it is possible that these individual parameters interfere with each other and thereby result in loss in performance on individual tasks. We show that under mild assumptions of the input data being intrinsically low-dimensi

11、onal relative to its ambient space (e.g. natural images lie on a much lower dimensional subspace as compared to their representation of individual pixels 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.B A C w(3)C(3) 1 w(2)C(2) 1 w(1)C(1) 1 w(2)C(2) 1 C(1)

12、w(1)C(1) 1 C(1) =w(1) w(3)C(3) 1 C(1) store retrieve Figure 1: Left: Parameters for different modelsw(1),w(2) andw(3) for different tasks are stored in superposition with each other inw. Right: To prevent interference between (A) similar set of parameter vectors w(s);s2f1;2;3g, we B (store) these pa

13、rameters after rotating the weights into nearly orthogonal parts of the space using task dependent context information(C 1 (s). An appropriate choice ofC(s) ensures that we canC(retrieve) w(k) by operationwC(k) in a manner thatw(s), fors6=k will remain nearly orthogonal, reducing interference during

14、 learning. with RGB values), it is possible to choosecontext that minimizes such interference. The proposed method has wide ranging applications such as training a neural networks in memory constrained environments, online learning of multiple tasks and over-coming catastrophic forgetting. Applicati

15、ontoCatastrophicForgetting: Online learning and sequential training of multiple tasks has traditionally posed a challenge for neural networks. If the distribution of inputs (e.g. changes in appearance from day to night) or the distribution output labels changes over time (e.g. changes in the task) t

16、hen training on the most recent data leads to poor performance on data encountered earlier. This problem is known as catastrophic forgetting 12, 15, 2. One way to deal with this issue is to maintain a memory of all the data and train using batches that are constructed by uniformly and randomly sampl

17、ing data from this memory (replay buffers 14). However in memory constrained settings this solution is not viable. Some works train a separate network (or sub-parts of network) for separate task 17, 19, 11. The other strategy is to selectively update weights that do not play a critical role on previ

18、ous tasks using variety of criterion such as: Fisher information between tasks 5, learning an attention mask to decide which weights to change 10, 18 and other criterion 22. However, these methods prevent re-use of weights in the future and therefore intrinsically limit the capacity of the network t

19、o learn future tasks and increase computational cost. Furthermore, for every new task, one additional variable per weight parameter indicating whether this weight can be modied in the future or not (i.e.L new parameters per task) needs to be stored. We propose a radically different way of using the

20、same set of parameters in a neural network to perform multiple tasks. We store the weights for different tasks in superposition with each other and do not explicitly constrain how any specic weight parameter changes within the superposition. Furthermore, we need to store substantially less additiona

21、l variables per new task (1 additional variable per task for one variant of our method; Section 2.1). We demonstrate the efcacy of our approach of learning viaparametersuperposition on two separate online image-classication settings: (a) time-varying input data distribution and (b) time-varying outp

22、ut label distribution. With parameter superposition, it is possible to overcome catastrophic forgetting on the permuting MNIST 5 task, continuously changing input distribution on rotating MNIST and fashion MNIST tasks and when the output labels are changing on the incremental CIFAR dataset 16. 2 Par

23、ameterSuperposition The intuition behindParameterSuperposition(PSP) as a method to store many models simultaneously into one set of parameters stems from analyzing the fundamental operation performed in all neural networks multiplying the inputs x2 N by a weight matrix W2 M N to compute features (y

24、= Wx). Over-parameterization of a network essentially implies that only a small sub-space spanned by the rows ofW in N are relevant for the task. Let W 1 ;W 2 ;:;W K be the set of parameters required for each of the K tasks. If only a small subspace in N is required by eachW k , it should be possibl

- 配套讲稿：
如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。

- 特殊限制：
部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。

- 关 键 词：
- superposition of many models into one