California Housing - Data Exploration

Taking a lot of inspiration from this Kaggle kernel by Pedro Marcelino, I will go through roughly the same steps using the classic California Housing price dataset in order to practice using Seaborn and doing data exploration in Python.

Secondly, this notebook will be used as a proof of concept of generating markdown version using jupyter nbconvert --to markdown notebook.ipynb in order to be posted to my Jekyll blog.

More …

Learning Optimisation

Learning Optimisation

Long post covering Exponentially Weighted Averages, Bias Correction, Gradient Descent with Momentum, RMSprop, Adam optimisation technique and Learning Rate Decay.

It covers part of the second week material of the Improving Deep Neural Networks Coursera course, and like other course notes posts it is mainly my notes from the lectures, rephrased in a language that makes sense to me and trying to answer the questions I got from the lectures.

One nice thing is that I get to write Latex equations way to seldom, and this post is full of them.

etc…

More …

Bias and Variance

Terminology

Bias

The model has big constraints/assumptions as to how it should look, which means it ends up underfitting the actual data. An obvious example would be a linear model trying to fit clearly non-linear data.

Solutions

  • Test adding more hidden layers or units
  • Increase number of learning iterations / learning rate
  • Experiment with network architecture

Variance

The model is too free in adapting exactly to cover the training data, which means it ends up overfitting as some of the outlier data points probably won’t be good predictors for another data set.

Solutions

  • Train on more data
  • Regularisation
  • Experiment with network architecture
More …

Activation Functions

Just some notes for myself going through the Deep Learning specialisation on Coursera.

Tanh: Good for hidden layers, as output with mean 0 makes learning easier for next layer.

Sigmoid: Good for output layer in binary classification since 0-1 maps onto certainty 0 - 100%.

ReLU: Both previous functions can slow down gradient descent when the slope nears 0 for the derivative. In order to get around this, rectified linear unit (ReLU) are popular. This is the most common activation function.

Leaky ReLU: Version that avoids having the slope being zero for negative value of z. Has the same advantage as regular ReLU, not used as much in practice.