Let's learn perceptron the noob way - Part 2

Posted on November 12, 2017
Last Edited on November 14, 2017
Tags: programming machine-learning perceptron

Previously, we have used the weights that were already solved. So, this time we will be formulating on how those weights are actually updated to an optimal configurations.


In the previous post, we have used the weights that were already available from a trained system. But we don’t get such stable weights always. And randomly guessing or doing brute-force solution is pretty much expensive (computationally).

Here, we just have 2 weights for which brute force might work. But that’s just a childish imagination - having just few weights.

In real case scenario, we have many weights. I repeat many. So, what we do is start from random values of weights and use certain process to find our way to the weights that seem good enough for the system.

Random weights

>> synapses = 2*np.random.random((2, 1)) - 1
       [ 0.43981246]])

Using the random weights, we try the prediction.

>> y = np.dot(X_train, synapses)
array([[ 0.        ],
       [ 0.43981246],
>> z = sigmoid(y)
array([[ 0.5       ],
       [ 0.60821434],
       [ 0.33796224],
       [ 0.44211669]])

Oh shit! This is not good.

Don’t worry if the prediction is mayhem. We have a technique to learn the weights through error in the prediction.


What we know is what the output should be for each training example from Y_train. [Core gist of supervised learning].

So, we have a metric to know how much off the predicted value is from what is expected (as from Y_train).

One of the ways for calculating the error is just using the difference:

error = target - prediction

In machine learning world, we call this a cost function

>> errors = Y_train - z
array([[ 0.5       ],


Using the error, we can know how much we should add/subtract to the corresponding weight in order to approach the target value.

  1. If the error is positive, we have to add the error to the weight by that much amount.
  2. If the error is negative, we have to subtract the error from the weight by that much amount.
new_weight -> old_weight + error

This is vaguely the rule for updating the weight


wi = wi + error

However, the error factor isn’t solely responsible for the weight. If error is alone responsible for the update, the learning is pretty much very slow. Each weight is also contributed by the corresponding input it has connection to. So, in some ways inputs do influence the corresponding weights.

The intuition behind this input coming into the play is relatable.

Say you have a metal rod. In normal condition, when you touch the rod it doesn’t really have influence on your reaction.
When you touch a hot rod, you immediately withdraw you hand. So, in some ways the synapse is wired to respond to the hotness or coldness of the rod itself. The material from which rod is made can be considered as weight here.


your_reaction -> (coldness/hotness of rod) + (rod's material)

This intuition is just a naive one which I have thought of.

Remember this the whole time or on every machine learing processes:

Inputs affect the outputs.

Back to square one - Let’s update the weight again.

wi = wi + error * input


w1  ->  w1 + error * x1
w2  ->  w2 + error * x2

For each training set (X_train, Y_train):

w1  -> w1 + error1 * x11 + error2 * x12 + ...     ==> w1 + (error1 + error2 +...) * x1j
w2  -> w2 + error1 * x21 + error2 * x22 + ...     ==> w2 + (error1 + error2 +...) * x2j  

j   ->  individual training instance
x11, x12, ...    ->  input x1 at each training instance 'j'
x21, x22, ...    ->  input x2 at each training instance 'j'


wi  -> wi + average_error * xi

Learning rate

So far so good, heh?

Yes. But no. Here’s the problem. If we did follow the above rule, then we run into one specific problem.

We could just keep on oscillating here and there. At one time the error might be positive, at another time it might be negative. The weights might just be like a pendulum where the values oscillate here and there.

So, we introduce a parameter (Hyperparameter) that directly affects this behaviour. One of such parameters is learning rate that tells by how much factor the weights have to be updated to get to the optimal configuration.

There are two things that matter:

  1. If learning rate is too big, we might miss the optimal configuration. We might be oscillating here and there.
  2. If learning rate is too small, it might take eternity to reach the desired weights.

Learning rate is generally represented by Greek letter called eta.

>> eta = 0.01

Perceptron Learning Rule (Finally)

wi  ->  wi + eta * average_error * xi
>> delta = np.dot( X_train.T, errors)
array([[ 1.21992107],
       [ 0.94966897]])
>> synapses +=  eta * delta
       [ 0.53477936]])

After Several Iterations

Let’s run the update rule for many iterations and see the weights.

for i in range(3000):
    y = np.dot(X_train, synapses)
    z = sigmoid(y)
    errors = Y_train - z
    delta = np.dot(X_train.T, errors)
    synapses += eta * delta
>> synapses
array([[ 3.38947881],
       [ 3.50387352]])

>> y = np.dot(X_train, synapses)
array([[ 0.        ],
       [ 3.50387352],
       [ 3.38947881],
       [ 6.89335232]])

>> z = sigmoid(y)
array([[ 0.5       ],
       [ 0.97079778],
       [ 0.9673741 ],
       [ 0.99898652]])

Now that’s better. Phew!!!

That Output to Zero Inputs seems out of place

Yes. For a perceptron, if all the inputs are zero the activation gives an output of a constant 0.5.
This doesn’t look quite good, right? - Of course it doesn’t.

To handle such inputs, the perceptron has a concept called bias. A bias is nothing but (yet another) connection that has constant input of 1. We update the bias like any other weights like above. A bias, in fact, is just shifting the activation function here or there.

Final Note

The optimial weight I have been mentioning is just one of the configurations of weights for which the model accurately predicts the output. However it is not guaranteed that the system/model converges to a global optmial configuration. This is the intuition behind Gradient Descent

The intuition behind perceptron update rule, in fact, can be mathematically derived using Gradient Descent. And that my friend, is a topic for some other day.

Food For Thought

Perceptron is nothing more than a way to construct a linear boundary between the datasets for classificaiton.

Try the perceptron for XOR gate and see if you get what linearity is.