ADAPTIVE CRITIC FOR SIGMA-PI NETWORKS

Citation
Rs. Neville et Tj. Stonham, ADAPTIVE CRITIC FOR SIGMA-PI NETWORKS, Neural networks, 9(4), 1996, pp. 603-625
Citations number
52
Categorie Soggetti
Mathematical Methods, Biology & Medicine","Computer Sciences, Special Topics","Computer Science Artificial Intelligence",Neurosciences,"Physics, Applied
Journal title
ISSN journal
08936080
Volume
9
Issue
4
Year of publication
1996
Pages
603 - 625
Database
ISI
SICI code
0893-6080(1996)9:4<603:ACFSN>2.0.ZU;2-3
Abstract
This article presents rm investigation which studied how training of s igma-pi networks with the associative reward-penalty (A(R-P)) regime m ay be enhanced by using two networks in parallel. The technique uses w hat has been termed an unsupervised ''adaptive critic element'' (ACE) to give critical advice to the supervised sigma-pi network. I Ve utili se the conventions that the sigma-pi neuron model uses (i.e., quantisa tion of variables) to obtain an implementation we term the ''quantised adaptive critic'', which is hardware realisable. The associative rewa rd-penalty training regime either rewards, r = I, the neural network b y incrementing the weights of the net by a delta term times a learning rate, alpha, or penalises, r = 0, the neural network by decrementing the weights by an inverse delta term times the product of the learning rate and a penalty coefficient, alpha x lambda(rp). Our initial resea rch, utilising a ''bounded'' reward signal, r is an element of {0,... ,1}, found that the critic provides advisory information to the sigma- pi net which augments its training efficiency. This led us to develop mt extension to the adaptive critic and associative reward-penalty met hodologies, utilising an 'unbounded'' reward signal, r is an element of {-1,...,2}, which permits penalisation of a net even when the penal ty coefficient, lambda(rp), is set to zero, lambda(rp) = 0. One should note that with the standard associative reward-penalty methodology th e net is normally only penalised if the penalty coefficient is non-zer o (i.e., 0 < lambda(rp) less than or equal to 1). One of the enigma of associative reward-penalty (A(R-P)) training is that it broadcasts sp arse information, in the form of an instantaneous binary reward signal , that is only dependent on the present output error. Here we put forw ard ACE and A(R-P) methodologies for sigma-pi nets, which are based on tracing the frequency of ''stimuli'' occurrence, and then using this to derive a prediction of the reinforcement. The predictions are then used to derive a reinforcement signal which uses temporal information. Hence one may use more precise information to enable more efficient t raining. Copyright (C) 1996 Elsevier Science Ltd