Catching AI with its pants down: Implement an Artificial Neuron from Scratch
We will strip the mighty, massively hyped, highly dignified AI of its cloths, and bring its innermost details down to earth! |
Prologue
This is part 4 of the blog series, Catching AI with its pants down. This blog series aims to explore the inner workings of neural networks and show how to build a standard feedforward neural network from scratch.
In this part we will implement all the equations that we derived from scratch in the previous parts.
Parts | The complete index for Catching AI with its pants down |
---|---|
Pant 1 | Some Musings About AI |
Pant 2 | Understand an Artificial Neuron from Scratch |
Pant 3 | Optimize an Artificial Neuron from Scratch |
Pant 4 | Implement an Artificial Neuron from Scratch |
Pant 5 | Understand a Neural Network from Scratch |
Pant 6 | Optimize a Neural Network from Scratch |
Pant 7 | Implement a Neural Network from Scratch |
Pant 8 | Demonstration of the Models in Action |
Code Implementation: an artificial neuron
All the codes will be in Python, using its object-oriented paradigm wherever possible (but I won’t bother with getters and setters for the most part). We will use primarily the NumPy library because its operations are very efficient for linear algebra computations involving arrays.
This implementation does not take advantage of parallel computing, so your GPU won’t make things any faster. But it takes advantage of NumPy’s superb optimization for computations with multidimensional arrays. Therefore, python loops are avoided as much as possible in the code, which is why we went through all that work to have everything as tensors.
We will also not implement any concurrent computing (so no multithreading of any sort) other than any that may have been baked into NumPy. Most deep learning libraries include concurrent and parallel computing capabilities, and also automatic differentiation capability. Moreover, none of those are really needed for a single artificial neuron. But they are absolutely priceless when training a network of neurons (a.k.a. neural network).
Constructor
We begin by implementing our constructor, where we initialize all our data members (also using it as an opportunity to lay them all out).
import numpy as np
class Neuron:
def __init__(self, X, Y):
self.X=X
self.Y=Y
self.X_batch=None
self.Y_batch=None
self.a=None
self.z=None
self.w=None
self.b=None
self.dAdZ=None
self.dJdA=None
self.dJdZ=None
self.dJdW=None
self.dJdB=None
We don’t really need to access the entire data (X
and Y
) during its instantiation. We could have chosen to initialize self.X
and self.Y
later. We only just needed the shape of X
, because we use it to get the number of features in our data which we use when we initialize our parameters. However, I chose to have both self.X
and self.Y
initialized at instantiation for the sake of it, so this is certainly an opportunity for some refactoring to improve the code.
Parameter initialization
We first initialize our parameters, and we will do this randomly.
Next, we will implement a method for parameter initialization. It’s just going to be plain random initialization.
def _initialize_parameters(self, random_seed=11):
prng=np.random.RandomState(seed=random_seed)
n=self.X.shape[0]
self.w=prng.random(size=(1, n))*0.01
self.b=np.zeros(shape=(1, 1))
Forward pass
Forward pass can be broken into two steps: First is the linear combination of the parameters and datapoint values to get the preactivation. Next is the passing of the preactivation through an activation function to get the activation.
The equations for forward pass are (see part 2):
\[\vec{z}=\vec{w}\mathbf{X}+b\] \[\vec{a}=f\left(\vec{z}\right)\]def _forward(self):
self.z = np.matmul(self.w, self.X_batch) + self.b
self.a=self._logistic(self.z)
Notice that that I used self.X_batch
instead of self.X
, because we perform our calculations on batches of samples from the dataset. We will initialize self.X_batch
during training (i.e. inside the train
method).
Activation function
Next, we implement out activation function. We will only do logistic for this model of an artificial neuron (see part 2). Check out the deep neural network code for some other activation functions.
\[\vec{a}=f\left(\vec{z}\right)=\frac{1}{1+e^{-\vec{z}}}\]def _logistic(self, z):
a = 1/(1+np.exp(-z))
return a
We will also implement the derivate of the activation function, and we are using the logistic function (see part 3). But note that we invoke this method only during backward pass, not forward pass. Presenting it here and also writing the code near that for the forward pass is just a matter of personal taste.
\[f'\left(\vec{z}\right)=\vec{a}\odot\left(1-\vec{a}\right)\]def _logistic_gradient(self, a):
dAdZ = a * (1-a)
return dAdZ
Calculation of Cost
Next, we should implement the method for computing the cost, but I didn’t do it for artificial neuron, but instead did it for the main thing, the deep neural network code, and the blog post for it is coming soon.
Note that you don’t actually need the cost for the training process, but instead the cost gradients. The cost is just there to tell us how the training is progressing. This is the equation we would implement.
\[J=-\frac{1}{m}\bullet\sum_{j}^{m}{y_i\cdot \log{(y}_i)+(1-a_i)\bullet\log({1-a}_i)}\]Backward pass
Now we will optimize our parameters in such a way that our loss decreases.To do this, we first compute the cost gradient \(\frac{\partial J}{\partial\vec{w}}\), which we showed in part3:
\[\frac{\partial J}{\partial\vec{w}}=\frac{\partial J}{\partial\vec{z}}\frac{\partial\vec{z}}{\partial\vec{w}}=\ \frac{\partial J}{\partial\vec{z}}\mathbf{X}^T=\frac{\partial J}{\partial\vec{a}}\odot\frac{\partial\vec{a}}{\partial\vec{z}}\mathbf{X}^T=\frac{\partial J}{\partial\vec{a}}\odot f'(\vec{z})\mathbf{X}^T\]For a logistic loss function and a logistic activation function, we have:
\[\frac{\partial J}{\partial\vec{w}}=-\frac{1}{m}\bullet\left(\frac{\vec{y}}{\vec{a}}-\frac{1-\vec{y}}{1-\vec{a}}\right)\ \odot(\vec{a}\odot\left(1-\vec{a}\right))\mathbf{X}^T\]We could directly implement the above equation, but I chose to implement it in stages, with each gradient computed at each stage. This will make it a little easier to swap in other activation functions and loss functions if you ever choose to do so in the future.
So, we implement the following equations step by step:
\[\frac{\partial\vec{a}}{\partial\vec{z}}:=f'\left(\vec{z}\right)=\vec{a}\odot\left(1-\vec{a}\right)\] \[\frac{\partial J}{\partial\vec{a}}=-\frac{1}{m}\bullet\left(\frac{\vec{y}}{\vec{a}}-\frac{1-\vec{y}}{1-\vec{a}}\right)\] \[\frac{\partial J}{\partial\vec{z}}=\frac{\partial J}{\partial\vec{a}}\odot\frac{\partial\vec{a}}{\partial \vec{z}}\] \[\frac{\partial J}{\partial\vec{w}}=\ \frac{\partial J}{\partial\vec{z}}\mathbf{X}^T\]The cost gradients for the bias is:
\[\frac{\partial J}{\partial b}=\sum_{j=1}^{m}\left(\frac{\partial J}{\partial\vec{z}}\right)_j\]As we showed in part 3, we can also choose to use this equation instead:
\[\frac{\partial J}{\partial b}=\frac{\partial J}{\partial\vec{z}}\ \frac{\partial\vec{z}}{\partial b}\]Where $ \frac{\partial \vec{z}}{\partial b} $ is an $ m $-by-$ 1 $ vector of ones (i.e. has same shape as $ \vec{z}^T $).
Both equations, implemented as self.dJdB= np.sum(self.dJdZ, axis=1)
and self.dJdB= np.matmul(self.dJdZ, np.ones(self.z.T.shape))
, produce the same result. We will use the former.
def _backward(self):
m = self.X_batch.shape[1]
self.dAdZ=self._logistic_gradient(self.a)
self.dJdA = - (1/m) *((self.Y_batch / self.a) - ((1 - self.Y_batch) / (1 - self.a)))
self.dJdZ = self.dAdZ * self.dJdA
self.dJdW= np.matmul(self.dJdZ, self.X_batch.T)
self.dJdB= np.sum(self.dJdZ, axis=1)
Update parameters via gradient descent
Next, we update each parameter using gradient descent:
\[w_{new}=w_{old}-\gamma\frac{\partial J}{\partial w_{old}}\] \[b_{new}=b_{old}-\gamma\frac{\partial J}{\partial b_{old}}\]
Where $ \gamma $ is the learning rate (a.k.a. step size). It's a hyperparameter, meaning that it is a variable you directly set and control.
Note that $ \frac{\partial J}{\partial w_{old}} $ is simply the $ \frac{\partial J}{\partial w} $ that we just calculated, and the same is true for $ \frac{\partial J}{\partial b_{old}} $.
With this, we’ve completed one iteration of training. We repeat this as many times as we want. Eventually, we expect to end up with an artificial neuron that has learned the underlying relationship between the features and the target.
def _update_parameters_via_gradient_descent (self, learning_rate):
self.w = self.w - learning_rate * self.dJdW
self.b = self.b - learning_rate * self.dJdB
Training
The training process is as follows:
- Randomly initialize our parameters
- Run one iteration of training, which involves:
- Sample a batch from our dataset.
- Then run forward pass (i.e. move the data forward through the neuron).
- Then run backward pass to calculate our cost gradients.
- Then run gradient descent (which is technically part of backward pass), which uses the cost gradients to update the parameters.
- Repeat step 2 until we reach the specified number of iterations.
Therefore we combine the code snippets accordingly:
def train(self,num_iterations, learning_rate, batch_size, random_seed=11):
print("Training begins...")
self._initialize_parameters(random_seed=random_seed)
prng=np.random.RandomState(seed=random_seed)
for i in range(0, num_iterations):
random_indices = prng.choice(self.Y.shape[1], (batch_size,), replace=False)
self.Y_batch = self.Y[:,random_indices]
self.X_batch = self.X[:,random_indices]
self._forward()
self._backward()
self._update_parameters_via_gradient_descent(learning_rate=learning_rate)
print("Training Complete!")
We have three hyperparameters we can use to tune the training process: number of iterations, learning rate, and batch size.
Evaluation of trained artificial neuron
And finally, we implement methods for evaluating the neuron, including method for computing accuracy and precision. These are very pretty straightforward.
def _compute_accuracy(self):
if np.isnan(self.a).all():
print("Caution: All the activations are null values.")
return None
Y_pred=np.where(self.a>0.5, 1, 0)
Y_true=self.Y_batch
accuracy=np.average(np.where(Y_true==Y_pred, 1, 0))
return accuracy
def _compute_precision(self):
if np.isnan(self.a).all():
print("Caution: All the activations are null values.")
return None
Y_true=self.Y_batch
Y_pred=np.where(self.a>0.5, 1, 0)
pred_positives_mask = (Y_pred==1)
precision=np.average(np.where(Y_pred[pred_positives_mask]==Y_true[pred_positives_mask]))
We bundle the two methods under on method for evaluating the model:
def evaluate(self, X, Y, metric="accuracy"):
_available_perfomance_metrics=["accuracy","precision"]
metric=metric.lower()
if not any(m == metric.lower() for m in _available_perfomance_metrics):
raise ValueError
self.X_batch = X
self.Y_batch = Y
self._forward()
if metric=="accuracy":
score=self._compute_accuracy()
if metric =="precision":
score=self._compute_precision()
return score
I decided to get a little cheeky and throw ValueError
when an invalid string is passed to metric
, a formal parameter of the method evaluate.
I also decided to print a warning message if all my activations are NaNs (i.e. null values). From my experience, these can occur when the computations cause an arithmetic overflow or underflow.
All the codes
You can find the entire code, along with the code for deep neural network (the writeup for it is coming soon) and demonstrations using it to tackle real public research datasets, in this GitHub repo.
The version as of the end of March 2020 is repeated here for your convenience:
import numpy as np
np.seterr(over="warn", under="warn") # warn for overflows and underflows.
class Neuron:
def __init__(self, X, Y):
self.X=X
self.Y=Y
self.X_batch=None
self.Y_batch=None
self.a=None
self.z=None
self.w=None
self.b=None
self.dAdZ=None
self.dJdA=None
self.dJdZ=None
self.dJdW=None
self.dJdB=None
def _logistic(self, z):
a = 1/(1+np.exp(-z))
return a
def _logistic_gradient(self, a):
dAdZ = a * (1-a)
return dAdZ
def _forward(self):
self.z = np.matmul(self.w, self.X_batch) + self.b
self.a=self._logistic(self.z)
def _backward(self):
m = self.X_batch.shape[1]
self.dAdZ=self._logistic_gradient(self.a)
self.dJdA = -(1/m) * ((self.Y_batch / self.a) - ((1 - self.Y_batch) / (1 - self.a)))
self.dJdZ = self.dAdZ * self.dJdA
self.dJdW= np.matmul(self.dJdZ, self.X_batch.T)
self.dJdB= np.sum(self.dJdZ, axis=1)
def _update_parameters_via_gradient_descent (self, learning_rate):
self.w = self.w - learning_rate * self.dJdW
self.b = self.b - learning_rate * self.dJdB
def _initialize_parameters(self, random_seed=11):
prng=np.random.RandomState(seed=random_seed)
n=self.X.shape[0]
self.w=prng.random(size=(1, n))*0.01
self.b=np.zeros(shape=(1, 1))
def _compute_accuracy(self):
if np.isnan(self.a).all():
print("Caution: All the activations are null values.")
return None
Y_pred=np.where(self.a>0.5, 1, 0)
Y_true=self.Y_batch
accuracy=np.average(np.where(Y_true==Y_pred, 1, 0))
return accuracy
def _compute_precision(self):
if np.isnan(self.a).all():
print("Caution: All the activations are null values.")
return None
Y_true=self.Y_batch
Y_pred=np.where(self.a>0.5, 1, 0)
pred_positives_mask = (Y_pred==1)
precision=np.average(np.where(Y_pred[pred_positives_mask]==Y_true[pred_positives_mask]))
return precision
def train(self,num_iterations, learning_rate, batch_size, random_seed=11):
print("Training begins...")
self._initialize_parameters(random_seed=random_seed)
prng=np.random.RandomState(seed=random_seed)
for i in range(0, num_iterations):
random_indices = prng.choice(self.Y.shape[1], (batch_size,), replace=False)
self.Y_batch = self.Y[:,random_indices]
self.X_batch = self.X[:,random_indices]
self._forward()
self._backward()
self._update_parameters_via_gradient_descent(learning_rate=learning_rate)
print("Training Complete!")
def evaluate(self, X, Y, metric="accuracy"):
_available_perfomance_metrics=["accuracy","precision"]
metric=metric.lower()
if not any(m == metric.lower() for m in _available_perfomance_metrics):
raise ValueError
self.X_batch = X
self.Y_batch = Y
self._forward()
if metric=="accuracy":
score=self._compute_accuracy()
if metric =="precision":
score=self._compute_precision()
return score
See you in the next article!
comments powered by Disqus