Please scroll down...

Nodes in neighboring layers are connected with weights \(w_{ij}\), which are the network parameters.

A commonly used activation function is the Sigmoid function: \(f(\color{input}x\color{black}) = \frac{1}{1+e^{-\color{input}x}}\).

To measure how far we are from the goal, we use an error function \(E\). A commonly used error function is \(E(\color{output}y_{output}\color{black},\color{output}y_{target}\color{black}) = \frac{1}{2}(\color{output}y_{output}\color{black} - \color{output}y_{target}\color{black})^2 \).

For consistency, we consider the input to be like any other node but without an activation function so its output is equal to its input, i.e. \( \color{output}y_1 \color{black} = \color{input} x_{input} \).

$$ \color{input} x_j \color{black} = $$$$ \sum_{i\in in(j)} w_{ij}\color{output} y_i\color{black} +b_j$$

$$ \color{output} y \color{black} = f(\color{input} x \color{black})$$

$$ \color{output} y \color{black} = f(\color{input} x \color{black})$$

$$ \color{input} x_j \color{black} = $$$$ \sum_{i\in in(j)} w_{ij}\color{output} y_i \color{black} + b_j$$

Once we have the error derivatives, we can update the weights using a simple update rule:

$$w_{ij} = w_{ij} - \alpha \color{dweight}\frac{dE}{dw_{ij}}$$

where \(\alpha\) is a positive constant, referred to as the learning rate, which we need to fine-tune empirically.
[Note] The update rule is very simple: if the error goes down when the weight increases (\(\color{dweight}\frac{dE}{dw_{ij}}\color{black} < 0\)),
then increase the weight, otherwise if the error goes up when the weight increases (\(\color{dweight}\frac{dE}{dw_{ij}} \color{black} > 0\)),
then decrease the weight.

- the total input of the node \(\color{dinput}\frac{dE}{dx}\) and
- the output of the node \(\color{doutput}\frac{dE}{dy}\).

$$ \color{doutput} \frac{\partial E}{\partial y_{output}} \color{black} = \color{output} y_{output} \color{black} - \color{output} y_{target}$$

$$\color{dinput} \frac{\partial E}{\partial x} \color{black} = \frac{dy}{dx}\color{doutput}\frac{\partial E}{\partial y} \color{black} = \frac{d}{dx}f(\color{input}x\color{black})\color{doutput}\frac{\partial E}{\partial y}$$

where \(\frac{d}{dx}f(\color{input}x\color{black}) = f(\color{input}x\color{black})(1 - f(\color{input}x\color{black}))\) when
\(f(\color{input}x\color{black})\) is the Sigmoid activation function.
$$\color{dweight} \frac{\partial E}{\partial w_{ij}} \color{black} = \frac{\partial x_j}{\partial w_{ij}} \color{dinput}\frac{\partial E}{\partial x_j} \color{black} = \color{output}y_i \color{dinput} \frac{\partial E}{\partial x_j}$$

$$ \color{doutput} \frac{\partial E}{\partial y_i} \color{black} = \sum_{j\in out(i)} \frac{\partial x_j}{\partial y_i} \color{dinput} \frac{\partial E}{\partial x_j} \color{black} = \sum_{j\in out(i)} w_{ij} \color{dinput} \frac{\partial E}{\partial x_j}$$

Computing...