We study on-line generalized linear regression with multidimensional output
s, i.e., neural networks with multiple output nodes but no hidden nodes. We
allow at the final layer transfer functions such as the softmax function t
hat need to consider the linear activations to all the output neurons. The
weight vectors used to produce the linear activations are represented indir
ectly by maintaining separate parameter vectors. We get the weight vector b
y applying a particular parameterization function to the parameter vector.
Updating the parameter vectors upon seeing new examples is done additively,
as in the usual gradient descent update. However, by using a nonlinear par
ameterization function between the parameter vectors and the weight vectors
, we can make the resulting update of the weight vector quite different fro
m a true gradient descent update. To analyse such updates, we define a noti
on of a matching loss function and apply it both to the transfer function a
nd to the parameterization function. The loss function that matches the tra
nsfer function is used to measure the goodness of the predictions of the al
gorithm. The loss function that matches the parameterization function can b
e used both as a measure of divergence between models in motivating the upd
ate rule of the algorithm and as a measure of progress in analyzing its rel
ative performance compared to an arbitrary fixed model. As a result, we hav
e a unified treatment that generalizes earlier results for the gradient des
cent and exponentiated gradient algorithms to multidimensional outputs, inc
luding multiclass logistic regression.