This article presents rm investigation which studied how training of s
igma-pi networks with the associative reward-penalty (A(R-P)) regime m
ay be enhanced by using two networks in parallel. The technique uses w
hat has been termed an unsupervised ''adaptive critic element'' (ACE)
to give critical advice to the supervised sigma-pi network. I Ve utili
se the conventions that the sigma-pi neuron model uses (i.e., quantisa
tion of variables) to obtain an implementation we term the ''quantised
adaptive critic'', which is hardware realisable. The associative rewa
rd-penalty training regime either rewards, r = I, the neural network b
y incrementing the weights of the net by a delta term times a learning
rate, alpha, or penalises, r = 0, the neural network by decrementing
the weights by an inverse delta term times the product of the learning
rate and a penalty coefficient, alpha x lambda(rp). Our initial resea
rch, utilising a ''bounded'' reward signal, r is an element of {0,...
,1}, found that the critic provides advisory information to the sigma-
pi net which augments its training efficiency. This led us to develop
mt extension to the adaptive critic and associative reward-penalty met
hodologies, utilising an 'unbounded'' reward signal, r is an element
of {-1,...,2}, which permits penalisation of a net even when the penal
ty coefficient, lambda(rp), is set to zero, lambda(rp) = 0. One should
note that with the standard associative reward-penalty methodology th
e net is normally only penalised if the penalty coefficient is non-zer
o (i.e., 0 < lambda(rp) less than or equal to 1). One of the enigma of
associative reward-penalty (A(R-P)) training is that it broadcasts sp
arse information, in the form of an instantaneous binary reward signal
, that is only dependent on the present output error. Here we put forw
ard ACE and A(R-P) methodologies for sigma-pi nets, which are based on
tracing the frequency of ''stimuli'' occurrence, and then using this
to derive a prediction of the reinforcement. The predictions are then
used to derive a reinforcement signal which uses temporal information.
Hence one may use more precise information to enable more efficient t
raining. Copyright (C) 1996 Elsevier Science Ltd