Linear Regression with Gradient Descent in Clojure
This blog post describes how I tackled the first exercise assignment in Andrew Ng’s machine learning course on Coursera. We will be going through univariate linear regression (linear regression with a single feature), and using gradient descent to train our algorithm to find the most optimal values to provide the most accurate estimates.
My solution is hosted here. You may wish to check out the code, start a REPL and evaluate the various expressions as we go through them. This is one of the reasons why I’m using clojure to learn machine learning, the ability to interact with the code to understand it is second to none.
Our data looks like the following;
The first column is the population of some city, and the second column is the amount of profit that can be made in that city for a food truck. The problem that we are trying to solve is given an arbitrary city’s population, to predict how much profit can be made there.
The first thing we need to do is read the data from the file.
Let’s start defining some values;
Our final goal is finding the best values for theta
to give the most accurate prediction in our algorithm.
We need to provide our algorithm with a matrix, X
instead of the vector xs that we currently have. We will build the X
matrix as follows;
We’ll be making use of the core.matrix library for our matrix operations.
The above code snippet takes our input vector, xs
, which is (6.1101, 5.5277)
, and converts it to a matrix where the first column are set to all ones.
The ones column is added for finding our value of theta
. The reason why we add it is so we can multiple the matrices X
and theta
. Without the ones column, the matrix sizes mismatch.
Implementing gradient descent
Our main gradient descent algorithm looks like the following;
We are recursively calling gradient descent until the remaining iterations
reaches zero. During each recursive call, we are updating theta
whereby after each iteration it becomes more accurate and provides more accurate results when it’ll be used later on.
We update theta
by subtracting the theta
vector by the derivative of the cost function by doing (matrix/sub theta (cost-derivative X y theta))
To calculate the cost derivation, we define the following function;
The first important step here is using our current guess of what theta is, how accurate is this when we use the current theta using the real input features, X
and real output vector, y.
Our prediction function which takes in our input features and theta
looks like the following;
This is why we needed to have the ones column, otherwise this call would fail as the number of columns in X wouldn’t match the number of rows in theta
.
So if we have X defined as;
And theta
defined as [0 0]
. Then multiplying X
by theta
gives us a new vector with the value [0, 0]
. If we compare [0, 0]
with the actual values of y
, [17.592 9.1302]
, we can see our guess of theta
is not producing the right guess for y. We then subtract the current value of theta
, in this case [0 0]
, by the value of the cost-derivative
function. The next time we multiple X
by theta
, that will yield a more accurate prediction for the output.