Multilayer Perceptrons

Adding a “hidden” layer of perceptrons allows the model to find solutions to linearly inseparable problems. An example of this architecture is shown below.

The input to the hidden layer is the same as it was before, as is the the activation function.

$$h_j = \sum_i w_{ij} \times x_i + b_j$$

$$a_j = sigmoid(h_j)$$

The input to the output layer is the output of the hidden layer.

$$o_k = \sum_j w_{jk} \times a_j + b_j$$

$$a_k = sigmoid(o_k)$$

Stacking more and more layers helps the network learn more patterns. This is where the term “deep learning” comes from.

Neural Network Mathematics

The following notation will be used for the weights: $w_{ij}$ where $i$ denotes input units and $j$ denotes the hidden units.

For example, the following image shows are network with input units labeled $x_1$, $x_2$, and $x_3$, and hidden nodes labeled $h_1$ and $h_2$.

The weights need to be stored in a matrix, indexed as $w_{ij}$. Each row in the matrix corresponds to the weights leading out of a single input unit, and each column corresponds to the weights leading into a single hidden unit.

To initialize these weights we need the shape of the matrix. If we have a 2D array called features containing the input data:

n_records, n_inputs = features.shape
n_hidden = 2
weights_input_to_hidden = np.random.normal(0, n_inputs**-0.5, size=(n_inputs, n_hidden))

The result is a 2D array (or, “matrix”) named weights_input_to_hidden with dimensions n_inputs by n_hidden. Recall that the input to a hidden unit is the sum of all the inputs multiplied by the hidden unit’s weights. So, for each hidden unit, $h_j$, we use linear algebra to calculate the following sum.

$$h_j=\sum_i w_{ij} x_i$$

For the architecture described above, we’re multiplying the input (a row vector) by the weights (a matrix). This is accomplished by taking the dot product of the inputs with each column in the weights matrix. The following is for the input to the first hidden unit, $j=1$, which is associated to the first column of the weights matrix.

$$h_1=x_1 w_{11} + x_2 w_{21} + x_3 w_{31}$$

The sum of products for the second hidden layer output would be similar, except using the dot product of the inputs with the second column, and so on. Using Numpy, this calculation can be run for all the inputs and outputs simultaneously:

hidden_inputs =, weights_input_to_hidden)

Note that the weights matrix could also be defined as follows, with dimensions n_hidden by n_inputs, then multiply by a inputs when set up as a column vector. Note that the weight indices no longer match up with the labels used in earlier diagrams. In matrix notation, the row index always precedes the column index.

The critical thing to remember with matrix multiplication is that the dimensions must match. For example, three columns in the input vector, and three rows in the weights matrix. Here’s the rule:

  • If multiplying an array from the left, the array must have the same number of elements as there are rows in the matrix.
  • If multiplying the matrix from the left, the number of columns in the matrix must equal the number of elements in the array on the right.