It has recently been shown that gradient-descent learning algorithms f
or recurrent neural networks can perform poorly on tasks that involve
long-term dependencies, i.e., those problems for which the desired out
put depends on inputs presented at times far in the past, We show that
the long-term dependencies problem is lessened fora class of architec
tures called Nonlinear AutoRegressive models with eXogenous (NARX) rec
urrent neural networks, which have powerful representational capabilit
ies, We have previously reported that gradient descent learning can be
more effective in NARX networks than in recurrent neural network arch
itectures that have ''hidden states'' on problems including grammatica
l inference and nonlinear system identification. Typically, the networ
k converges much faster and generalizes better than other networks. Th
e results in this paper are consistent with this phenomenon, We presen
t some experimental results which show that NARX networks can often re
tain information for two to three times as long as conventional recurr
ent neural networks. We show that although NARX networks do not circum
vent the problem of long-term dependencies, they can greatly improve p
erformance on longterm dependency problems, We also describe in detail
some of the assumption regarding what it means to latch information r
obustly and suggest possible ways to loosen these assumptions.