Regression works by establishing a relationship between variables in the data that represent characteristics (known as the features) of the thing being observed, and the variable we’re trying to predict (known as the label).
To train the model, we start with a data sample containing the features, as well as known values for the label. We’ll then split this data sample into two subsets:
The use of historic data with known label values to train a model makes regression an example of supervised machine learning. There are lots of machine learning algorithms for supervised learning, and they can be broadly divided into two types:
y
value that is a numeric value, such as the price of a house or the number of sales transactions.y
value in a classification model is a vector of probability values between 0 and 1, one for each class, indicating the probability of the observation belonging to each class.We’re comparing a predicted value (ŷ
) with an observed value (y
) we refer to the difference between them as the residuals. We can summarize the residuals for all of the validation data predictions to calculate the overall loss in the model as a measure of its predictive performance. There are multiple ways to calculate the predictive performance:
Load the dataset and, clean up the data, identify the features and labels, and plot different graphs to identify relationships:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# load the training dataset
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/ml-basics/daily-bike-share.csv
bike_data = pd.read_csv('daily-bike-share.csv')
# add a new column named day to the dataframe
bike_data['day'] = pd.DatetimeIndex(bike_data['dteday']).day
numeric_features = ['temp', 'atemp', 'hum', 'windspeed']
bike_data[numeric_features + ['rentals']].describe()
# Plot a histogram for each numeric feature
for col in numeric_features:
fig = plt.figure(figsize=(9, 6))
ax = fig.gca()
feature = bike_data[col]
feature.hist(bins=100, ax = ax)
ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)
ax.set_title(col)
plt.show()
# plot a bar plot for each categorical feature count
categorical_features = ['season','mnth','holiday','weekday','workingday','weathersit', 'day']
for col in categorical_features:
counts = bike_data[col].value_counts().sort_index()
fig = plt.figure(figsize=(9, 6))
ax = fig.gca()
counts.plot.bar(ax = ax, color='steelblue')
ax.set_title(col + ' counts')
ax.set_xlabel(col)
ax.set_ylabel("Frequency")
plt.show()
# create scatter plots that show the intersection of feature and label values.
for col in numeric_features:
fig = plt.figure(figsize=(9, 6))
ax = fig.gca()
feature = bike_data[col]
label = bike_data['rentals']
correlation = feature.corr(label)
plt.scatter(x=feature, y=label)
plt.xlabel(col)
plt.ylabel('Bike Rentals')
ax.set_title('rentals vs ' + col + '- correlation: ' + str(correlation))
plt.show()
# plot a boxplot for the label by each categorical feature
for col in categorical_features:
fig = plt.figure(figsize=(9, 6))
ax = fig.gca()
bike_data.boxplot(column = 'rentals', by = col, ax = ax)
ax.set_title('Label by ' + col)
ax.set_ylabel("Bike Rentals")
plt.show()
Now that we’ve explored the data, it’s time to use it to train a regression model that uses the features we’ve identified as potentially predictive to predict the rentals label. The first thing we need to do is to separate the features we want to use to train the model from the label we want it to predict. To randomly split the data, we’ll use the train_test_split
function in the scikit-learn library. This library is one of the most widely used machine learning packages for Python.
from sklearn.model_selection import train_test_split
# Separate features and labels
X, y = bike_data[['season','mnth', 'holiday','weekday','workingday','weathersit','temp', 'atemp', 'hum', 'windspeed']].values, bike_data['rentals'].values
print('Features:',X[:10], '\nLabels:', y[:10], sep='\n')
Now we have the following four datasets:
In Scikit-Learn, training algorithms are encapsulated in estimators, and in this case we’ll use the LinearRegression estimator to train a linear regression model.
# Train the model
from sklearn.linear_model import LinearRegression
# Fit a linear regression model on the training set
model = LinearRegression().fit(X_train, y_train)
print (model)
We can quantify the residuals by calculating a number of commonly used evaluation metrics.
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)
Let’s try training our regression model by using a Lasso algorithm. We can do this by just changing the estimator in the training code.
from sklearn.linear_model import Lasso
# Fit a lasso model on the training set
model = Lasso().fit(X_train, y_train)
print (model, "\n")
Let’s train a Decision Tree regression model using the bike rental data.
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import export_text
# Train the model
model = DecisionTreeRegressor().fit(X_train, y_train)
print (model, "\n")
# Visualize the model tree
tree = export_text(model)
print(tree)
Let’s try a Random Forest model, which applies an averaging function to multiple Decision Tree models for a better overall model.
from sklearn.ensemble import RandomForestRegressor
# Train the model
model = RandomForestRegressor().fit(X_train, y_train)
print (model, "\n")
For good measure, let’s also try a boosting ensemble algorithm. We’ll use a Gradient Boosting estimator, which like a Random Forest algorithm builds multiple trees, but instead of building them all independently and taking the average result, each tree is built on the outputs of the previous one in an attempt to incrementally reduce the loss (error) in the model.
# Train the model
from sklearn.ensemble import GradientBoostingRegressor
# Fit a lasso model on the training set
model = GradientBoostingRegressor().fit(X_train, y_train)
print (model, "\n")
For any of the previous examples we can evaluate the model using the test data and plot the predicted vs actual visually.
# Evaluate the model using the test data
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)
# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Daily Bike Share Predictions')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()
Hyperparameters are values that change the way that the model is fit during these loops. Learning rate, for example, is a hyperparameter that sets how much a model is adjusted during each training cycle. A high learning rate means a model can be trained faster, but if it’s too high the adjustments can be so large that the model is never ‘finely tuned’ and not optimal. In machine learning, the term parameters refers to values that can be determined from data; values that you specify to affect the behavior of a training algorithm are more correctly referred to as hyperparameters.
Preprocessing refers to changes you make to your data before it is passed to the model (cleaning your dataset). It can also include changing the format of your data, so it’s easier for the model to use.
SciKit-Learn provides a way to tune hyperparameters by trying multiple combinations and finding the best result for a given performance metric. Let’s try using a grid search approach to try combinations from a grid of possible values for the learning_rate and n_estimators hyperparameters of the GradientBoostingRegressor estimator.
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, r2_score
# Use a Gradient Boosting algorithm
alg = GradientBoostingRegressor()
# Try these hyperparameter values
params = {
'learning_rate': [0.1, 0.5, 1.0],
'n_estimators' : [50, 100, 150]
}
# Find the best hyperparameter combination to optimize the R2 metric
score = make_scorer(r2_score)
gridsearch = GridSearchCV(alg, params, scoring=score, cv=3, return_train_score=True)
gridsearch.fit(X_train, y_train)
print("Best parameter combination:", gridsearch.best_params_, "\n")
# Get the best model
model=gridsearch.best_estimator_
print(model, "\n")
We trained a model with data that was loaded straight from a source file, with only moderately successful results. In practice, it’s common to perform some preprocessing of the data to make it easier for the algorithm to fit a model to it.
To apply these preprocessing transformations to the bike rental, we’ll make use of a Scikit-Learn feature named pipelines. These enable us to define a set of preprocessing steps that end with an algorithm. You can then fit the entire pipeline to the data, so that the model encapsulates all of the preprocessing steps as well as the regression algorithm. This is useful, because when we want to use the model to predict values from new data, we need to apply the same transformations (based on the same statistical distributions and category encodings used with the training data).
# Train the model
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
import numpy as np
# Define preprocessing for numeric columns (scale them)
numeric_features = [6,7,8,9]
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())])
# Define preprocessing for categorical features (encode them)
categorical_features = [0,1,2,3,4,5]
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))])
# Combine preprocessing steps
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
# Create preprocessing and training pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('regressor', GradientBoostingRegressor())])
# Or use a different estimator in the pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('regressor', RandomForestRegressor())])
# fit the pipeline to train a linear regression model on the training set
model = pipeline.fit(X_train, (y_train))
print (model)
# Display metrics
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)
# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Daily Bike Share Predictions - Preprocessed')
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()
We can save the model and load it whenever we need it, and use it to predict labels for new data. This is often called scoring or inferencing. The model’s predict
method accepts an array of observations, so you can use it to generate multiple predictions as a batch.
import joblib
# Save the model as a pickle file
filename = './bike-share.pkl'
joblib.dump(model, filename)
# Load the model from the file
loaded_model = joblib.load(filename)
# Create a numpy array containing a new observation (for example tomorrow's seasonal and weather forecast information)
X_new = np.array([[1,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869]]).astype('float64')
print ('New sample: {}'.format(list(X_new[0])))
# Use the model to predict tomorrow's rentals
result = loaded_model.predict(X_new)
print('Prediction: {:.0f} rentals'.format(np.round(result[0])))
# An array of features based on five-day weather forecast
X_new = np.array([[0,1,1,0,0,1,0.344167,0.363625,0.805833,0.160446],
[0,1,0,1,0,1,0.363478,0.353739,0.696087,0.248539],
[0,1,0,2,0,1,0.196364,0.189405,0.437273,0.248309],
[0,1,0,3,0,1,0.2,0.212122,0.590435,0.160296],
[0,1,0,4,0,1,0.226957,0.22927,0.436957,0.1869]])
# Use the model to predict rentals
results = loaded_model.predict(X_new)
print('5-day rental predictions:')
for prediction in results:
print(np.round(prediction))
You learned how regression can be used to create a machine learning model that predicts numeric values. You then used the scikit-learn framework in Python to train and evaluate a regression model.
While scikit-learn is a popular framework for writing code to train regression models, you can also create machine learning solutions for regression using the graphical tools in Microsoft Azure Machine Learning. You can learn more about no-code development of regression models using Azure Machine Learning in the Create a Regression Model with Azure Machine Learning designer module.
Source: Microsoft Learn