In our last post on supervised learning, we investigated how an agent could make a function f(x) = y, to train itself to best fit your data. Let’s see how we do this in python, and start with an easier example of a linear regression, where you have lots of data points and you want to make a linear function – i.e. y = ax+b. You want to make a function that looks something like this:
Single-Variable-Data: Training Your Data
We first import some necessary libraries and input data, which you can download from Kaggle. The data describes the increase of CO2 in the atmosphere as a function of time. Before we have done anything with the data, the plots look like:
We now want to find a linear correlation between our plots, and we start with splitting our data into training-data and test-data.
import numpy as np import pandas as pd from sklearn import linear_model import sklearn.metrics as sm import matplotlib.pyplot as plt import datetime df = pd.read_excel('c02.xlsx') x = df[['Decimal Date']] y = df['Carbon Dioxide (ppm)'] plt.scatter(x,y) plt.xlabel('Carbon Dioxide (ppm)') plt.ylabel('Date') plt.show() #Train and test splitt from sklearn.model_selection import train_test_split # Splitting X and y into training and testing sets x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=3)
We split our data into 80% for training and 20% for test. In the last post we wrote about 80% for training data, 10% cross validation and 10% for test. For simplicity, we’re just splitting it 80/20.
We create a linear regression object
#create linear Regression regressor = linear_model.LinearRegression()
And train the module using the training sets
#Train the model using the trainng sets regressor.fit(x_train,y_train)
We then try to predict the output for the testing, using the training model
#Predict the output y_test_pred = regressor.predict(x_test)
After we’ve trained the data, we see the correlation between our predicted value, i.e. the regression line, and the plots from our real test data.
plt.scatter(x_test, y_test, color='green') plt.plot(x_test,y_test_pred, color='black', linewidth=4) plt.show()
Now, lastly, let’s see how accurate our prediction is:
#Compute performance metrics print("Linear regressor performance:") print("Mean absolute error =", round(sm.mean_absolute_error(y_test, y_test_pred), 2)) print("Mean squared error =", round(sm.mean_squared_error(y_test, y_test_pred), 2)) print("Median absolute error =", round(sm.median_absolute_error(y_test, y_test_pred), 2)) print("Explain variance score =", round(sm.explained_variance_score(y_test, y_test_pred), 2)) print("R2 score =", round(sm.r2_score(y_test, y_test_pred), 2)) print("Score:", regressor.score(x_test,y_test))[Inspiration from Aritificial Intelligence with Python – Prateek Joshi]
We just saw an example of linear regression using just one input variable, meaning y = ax+b. But what about when your output variable is based on many inputs? From our last post on supervised learning we learned that we could predict a target label, y, through a feature vector of different parameters, x1, x2, x3, … xN. We see this as a matrix, such as:
So, to create a linear regression through a multi-variable input, we have the equation:
y = β0 + β1x1 + β2x2 + β3x3 +…. βNxN
The β values are the coefficients for the different input values that we use, β0 is called the intercept, which is a constant that is the expected mean value of y when all x=0.
Let’s try to calculate house prices through another dataset from Kaggle (https://www.kaggle.com/harlfoxem/housesalesprediction). If you want, you can download the csv file and try this out for yourself.
Our target label (y) will be “price”, and the feature vector (xn’s) will be the parameters in table below:
We start by reading the data in the csv file:
import numpy as np import pandas as pd from sklearn import linear_model import sklearn.metrics as sm import matplotlib.pyplot as plt import datetime df = pd.read_csv('kc_house_data.csv')
We set the feature vector and the target table:
x = df[['bedrooms','bathrooms','sqft_living','sqft_lot','floors','waterfront','view', 'condition','grade','sqft_above','sqft_basement','yr_built','zipcode','lat','long']] y = df['price']
We now split the data, x and y, into training and test sets
from sklearn.model_selection import train_test_split #Splitting X and y into training and testing sets x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=2)
test_size is the split between test data and training data, and
random_state means that we randomize the data, so we don’t just pick the first 90% as training set, and the last 10% as test set.
We continue with creating a linear regression model and train it using our training sets.
#create linear Regression regressor = linear_model.LinearRegression() #Train the model using the trainng sets regressor.fit(x_train,y_train)
So how accurate is our model? Let’s find the score of it
print("Score: ") print(regressor.score(x_test,y_test))
Where the output gives us an accuracy of 73% of our model. We continue trying to predict a test set (y) so that we can compare it with the test set.
y_test_pred = regressor.predict(x_test)
Another important consideration of how good our prediction model is, is to see the Root Mean Square Error (RMSE) of our predicted test set and our test set.
print("RMSE: ") print(np.sqrt(sm.mean_squared_error(y_test, y_test_pred)))
which gives us an RMSE at 198541. We’ll come back to this number and compare after we’ve optimized our prediction. To test how good our model is, we try to predict a house price, and to see how well our model performs. We type in the exact same values as the line 0 and 5 in the CSV file (see the table in the beginning), so that we can see how close we come to the real price value, which is $221900 and $1250000.
print("predicting") #x = df[['bedrooms','bathrooms','sqft_living','sqft_lot','floors','waterfront','view', 'condition','grade','sqft_above','sqft_basement','yr_built','zipcode','lat','long']] print(regressor.predict([[3,1,1180,5650,1,0,0,3,7,1180,0,1955,98178,47.5112,-122.257]])) print(regressor.predict([[4,4.5,5420,101930,1,0,0,3,11,3890,1530,2001,98053,47.6561,-122.005]]))
The output is $228723 and $1295636 for line 0 and 5. Not too far off, but still not close enough that makes us satisfied with our model. But what is it that we’ve actually learned? Remember this equation
y = β0 + β1x1 + β2x2 + β3x3 +…. βNxN
Our model has calculated the values for the different βi where x1 is ‘bedrooms’, x2 is ‘bathrooms’ and so on. You can easily obtain the βi coefficients by
#Print the intercept beta0 print("Intercept: ") print(regressor.intercept_) #Print the coefficients betai print("Coefficient for bedrooms, bathrooms, etc: ") print(regressor.coef_)
And we get the output
Coefficient for bedrooms, bathrooms, etc:
[-3.70902089e+04 4.20694517e+04 1.16672135e+02 -7.20347187e-02
5.04516544e+03 6.03889441e+05 5.37348568e+04 2.33734952e+04
9.80743511e+04 7.41957984e+01 4.24763368e+01 -2.69914816e+03
-6.04677130e+02 6.09100008e+05 -2.05009761e+05]-6.04677130e+02 6.09100008e+05 -2.05009761e+05]
But with an accuracy of only 73%, let’s try to improve our prediction model through something called gradient boosting regression.
Gradient boosting regression is not something that you can quickly and easily explain, so to the reader: beware, this will be simplification where we just look at the fundamentals and with a minimal of math that is behind the algorithm. We’ll rather make a separate blog post about gradient boosting later, but if you can’t wait, we recommend that you look at these links for more information:
Simplified Explanation of Gradient Boosting
Before we jump straight to Gradient boosting, let’s look at ensembles and boosting.
An ensemble is a combination of simple individual models that together create a more powerful new model. Machine learning boosting is a method for creating ensembles of classifiers or regressors incrementally, by first fitting an individual model, such as a decision tree or linear regression, to the data. The second model is then made to predict the cases where the first model performed poorly. The boosting process is repeated many times, with the aim of improving the shortcomings of all the previous models.
As you see in the graphs above, the first model is “dummy regressor”, which is just a constant, such as the mean of the output values. Then we calculate something called residuals, which are the differences between the desired values and predicted values. With this information, we calculate a second model, which is a small decision tree that tries to model these residuals. This process is then repeated many times, where you calculate the latest residuals and train another model to try to fix the errors.
At last, gradient boosting is a type of machine learning boosting that relies on the intuition that the best possible next model, when combined with previous models, minimizes the overall prediction error. The boosted ensembles are optimized with respect to a loss function, and minimize it. We start by making a set of predictions ŷ[i]. We try to find the “error” in our predictions, J(y, ŷ), through mean square error:
J(y, ŷ)= ∑(y[i] – ŷ[i])2
We adjust the prediction, ŷ, to reduce the error through:
ŷ[i] = ŷ[i] + α*f(i)
where α is the learning rate, and f(i) is approximately the gradient of J:
f(i) ≈ ∇J(y, ŷ)
So, in our prediction model, each model estimates the gradient of the loss, f(i). The gradient descent takes a sequence of steps to reduce the errors.[https://www.displayr.com/gradient-boosting-the-coolest-kid-on-the-machine-learning-block/] [http://www.cse.chalmers.se/~richajo/dit865/files/gb_explainer.pdf] [https://en.wikipedia.org/wiki/Gradient_boosting] [https://medium.com/mlreview/gradient-boosting-from-scratch-1e317ae4587d]
Gradient Boosting Regression in Python
So, we try again, but this time using a gradient boosting model.
from sklearn import ensemble clf = ensemble.GradientBoostingRegressor(n_estimators = 500, max_depth = 5, min_samples_split = 2, learning_rate = 0.1, loss = 'ls') clf.fit(x_train,y_train) y_test_pred2 = clf.predict(x_test)
In our gradient boosting model we used 500 trees, where there are 6 levels in each tree. We then print the new score of our prediction:
print("New Score: ") print(clf.score(x_test,y_test)) print("RMSE - new:") print(np.sqrt(sm.mean_squared_error(y_test, y_test_pred2))) print("Predicting house value: ") print(clf.predict([[3,1,1180,5650,1,0,0,3,7,1180,0,1955,98178,47.5112,-122.257]])) print(clf.predict([[4,4.5,5420,101930,1,0,0,3,11,3890,1530,2001,98053,47.6561,-122.005]]))
Which gives us an accuracy of 91.6%, and a new RMSE for 110994, which is way less than than the prior value we obtained, of 198541. And finally, we try to predict the house prices of the house of the line 0 and 5, and see that our predictions give us prices of:
- $228.723 and $1.295.636
Compared to the real prices:
- $221.900 and $1.225.000
This is a huge improvement, where we went from an accuracy of 73% to 91.6%.
So there you have it, you can now pack your bags immediately and move to King County to get the best deal on a house. Fun fact: according to the internet, Seattle (which is in King County) is ranked as the 4th most hipster city in the world. So supervised learning might get you one of these neighbors.