Wednesday, August 2, 2023

Machine Learning - Build A GradientBoostingRegressor Model

This post is going to demonstrate the entire process of building a machine learning model. Model deployment will not be covered in this blog.


About Model Selection

Model selection is a critical part of building a machine learning model. In practice, the candidate models need to be tested with the sufficient data and be properly evaluated. In this post, the Gradient Boosting Regressor model is selected based on comparison on paper. As an ensemble model, it gives the best result produced by the specified number of underlying decision trees. And decision tree is an ideal model that we can use grid search to find the best hyperparameters. This model probably can efficiently and effectively achieve the expected result.

Supervised learning will be selected for this exercise.

 

Data Preprocessing

The original temperature data is scraped from a mountain hiking website. But the scraping process is not included in this blog. The temperature data along with the coordinates of the mountains is loaded into the predefined table in the database. We could say the data engineering part is almost done here.


Data Samples

Encoding processing is not required as there are no categorial features used by the model in the dataset. Other data transformation and data conversion are not in scope as well. Considering the model will accept the coordinates and altitude inputted by user to make a temperature prediction after deployed, we don’t standardize the features in this exercise. As a reference, I tried standardized features and discovered the model produces the same output as the non-standardized does.

No need to mention, A few cleansing tasks must be done before it is fed to the machine learning model. Mainly they are missing data handling and column extraction. After examined the data, we found nulls in the temperature columns which cannot be simply replaced with a zero or mean value. The rule is to exclude them from the dataset. Additionally, the dataset contains a part of features which will not be used by the model, so that part will also be removed. The features in need are Latitude, Longitude, and Altitude, and the target is Avg (average temperature). Avg instead of Max or Min is chosen because it reduces the effect of an extremely high or low temperature corresponding to a particular weather at some point. 

    #
    # drop null data
    df = dforg.dropna()
    print(df[df["Avg"].isna()])
    #
    # extract the features and the target
    X = df[["Latitude", "Longitude", "Altitude"]]
    y = df["Avg"]

 

Dataset Preparation

To do this, we simply borrow train_test_split from sklearn.model_selection, with the proportion of test data as well as random state specified. 

    #
    # divide the dataset into training dataset and test dataset
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123) 


Train the Model

After the data prepared, we create an instance of GradientBoostingRegressor, and then call its fit() member function to fit the model.

    #
    # create the model
    gbrd = GradientBoostingRegressor(random_state=123)
    #
    # train the model
    gbrd.fit(X_train, y_train)

 

Make a Prediction

We pass the test dataset to the fitted model and get the estimated temperatures in return. The predictions will be evaluated and visualized in later steps.

    #
    # make a prediction
    y_pred = gbrd.predict(X_test)


Model Evaluation

It's time to see how well the model performs now. The evaluation is carried out through comparing the score given by the model and root mean squared error (RMSE). The score computed by the model is in deed the same as the coefficient of determination (R2). Technically, the higher the score, the better the model. But we need to consider if the model is over-fitting at the same time. 

     #
    # evaluate the model
    scored = gbrd.score(X_test, y_test)
    #
    # save the predicted and actual temperatures
    tmp_pred = y_pred.reshape(np.size(y_pred), 1)
    tmp_test = y_test.values.reshape(np.size(y_test), 1)
    tempd = pd.DataFrame(np.hstack([tmp_test, tmp_pred]))
    rmsed_test = np.sqrt(mean_squared_error(y_test, y_pred))

  

Hyperparameter Optimization

A combination of Grid Search and Cross Validation is employed to search for the best parameters. The result may vary according to the parameter grid we pass to GridSearchCV. So, this optimization process is iterative until we can find the most practical ones. In this example, we first use the default parameters for the estimator, then give a try to the following parameters.

    parameters = {
        'n_estimators' : [3, 5, 10, 30, 50, 100],
        'max_features' : [1, 3, 5, 10],
        'random_state' : [123],
        'min_samples_split' : [3, 5, 10, 30, 50],
        'max_depth' : [3, 5, 10, 30, 50]
    }

As we can tell from the graph before and after optimization, the score goes up a little and the predicted values are more correlated with the actual ones. During the fitting, the random state is fixed as stated in the codes. An unfixed random state is also tested but it turned out that a fixed one has a better performance. 

best params: {'max_depth': 3, 'max_features': 3, 'min_samples_split': 3, 'n_estimators': 30, 'random_state': 123}

  

Visualization

A BI tool can be used for visualization. In data analysis, the estimated results in fact can be fed back into the BI tool in many ways, for example, the results are written into a database table where the tool will read the data, or the BI tool can integrate with the Python program through an engine. Here we use matplotlib to visualize the results. The embedded codes easily transform the data into a visual graph, costly effective.



Predicted vs Actual

    #
    # visualize the results
    plts.use('ggplot')
    fig, ax = plt.subplots()
    #
    # default parameters
    ax.scatter(tempd.iloc[:, 0], tempd.iloc[:, 1], color='darkblue', label='default')
    #
    # optimized parameters
    ax.scatter(tempb.iloc[:, 0], tempb.iloc[:, 1], marker='x', color='crimson', label='optimized')
    #
    # reference
    xtmp = np.array([np.min(tempd.iloc[:, 1]), np.max(tempd.iloc[:, 1])])
    ytmp = xtmp.copy()
    ax.plot(xtmp, ytmp, label='reference', color='gray')
    #
    # show scores and correlation rates
    strd = 'score: ' + str(scored) + '   rmse: ' + str(rmsed_test)
    strb = 'score: ' + str(scoreb) + '   rmse: ' + str(rmseb_test)
    ax.text(xtmp.min(), ytmp.max(), strd, color='darkblue')
    ax.text(xtmp.min(), ytmp.max()-1,strb, color='crimson')
    #
    # graphical setting
    ax.legend(loc='lower right')
    ax.set_xlabel('actual temperature')
    ax.set_ylabel('predicted temperature')
    fig.suptitle('mountain temperature prediction')

    plt.show()

 

Save the Model

The trained model with optimized hyperparameters assigned is saved as a binary file. The pickle library is perfectly suited to process this.

     with open('MountTempModel.pkl', mode='wb') as f:

        pickle.dump(gbrb, f)

To deploy the model, we can call pickle.load to instantiate it. We'll discuss it in another blog.

 

A Note

If we only look at the scores, the model seems to perform well for this particular case. The optimization finds a better estimator as we expected. Later, I used the model to get the predicted temperature for the training data, and calculated its RMSE as well. The RMSE is by far smaller than the test data's RMSE. We could came to a conclusion that the model is actually over-fitting.

I am thinking to come up with a post which will build and compare a series of selected models for a specified dataset. So how well they perform will be visualized on the same graph. Hopefully I could streamline the process as simple as that we feed the input dataset and then get the visual evaluation results for all those models. (Please refer to Machine Learning - Build And Compare Regression Models)



The complete Python program is attached below.

import pandas as pd
import numpy as np
import pickle
from datetime import datetime
import oracledb
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
import matplotlib.style as plts
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

def PredictMountTemp():
    #
    # load the dataset from the database
    sqlstr = 'select * from mount_temp'
    columns = ['No', 'Mountain', 'Latitude', 'Longitude', 'Altitude', 'Max', 'Avg', 'Min', 'Prefecture']

    try:
        with oracledb.connect(user="test", password='1234', dsn="localhost/xepdb1") as conn:
            with conn.cursor() as cursor:
                dforg = pd.DataFrame(cursor.execute(sqlstr), columns=columns)
    except oracledb.Error as e:
        print(f'Failed to fetch data from the database ({str(e)})')
        return

    dforg = dforg.set_index('No')

    #
    # drop null data
    df = dforg.dropna()
    print(df[df["Avg"].isna()])

    #
    # extract the features and the target
    X = df[["Latitude", "Longitude", "Altitude"]]
    y = df["Avg"]

    print(pd.concat([X, y], axis=1))

    #
    # divide the dataset into training dataset and test dataset
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

    print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

    #
    # create the model
    gbrd = GradientBoostingRegressor(random_state=123)
    #
    # train the model
    gbrd.fit(X_train, y_train)
    #
    # make a prediction
    y_pred = gbrd.predict(X_test)
    #
    # evaluate the model
    scored = gbrd.score(X_test, y_test)
    #
    # save the predicted and actual temperatures
    tmp_pred = y_pred.reshape(np.size(y_pred), 1)
    tmp_test = y_test.values.reshape(np.size(y_test), 1)
    tempd = pd.DataFrame(np.hstack([tmp_test, tmp_pred]))
    rmsed_test = np.sqrt(mean_squared_error(y_test, y_pred))

    #
    # using Grid Search to find the best estimator
    starttime = datetime.now()
    print('Grid search, starting from: ', starttime.isoformat())

    parameters = {
        'n_estimators' : [3, 5, 10, 30, 50, 100],
        'max_features' : [1, 3, 5, 10],
        'random_state' : [123],
        'min_samples_split' : [3, 5, 10, 30, 50],
        'max_depth' : [3, 5, 10, 30, 50]
    }
    gbrb = GridSearchCV(estimator=GradientBoostingRegressor(), param_grid=parameters, cv=10)
    gbrb.fit(X_train, y_train)
    #
    # time consumed
    endtime = datetime.now()
    print('Grid search, ending at: ', endtime.isoformat())
    print('Time consumed for optimization: ', (endtime-starttime))
    #
    # best estimator
    print('Best params: {0}'.format(gbrb.best_params_))
    print('Best estimator: {0}'.format(gbrb.best_estimator_))

    y_pred = gbrb.predict(X_test)

    scoreb = gbrb.score(X_test, y_test)
    #
    # save the predicted and actual temperatures
    tmp_pred = y_pred.reshape(np.size(y_pred), 1)
    tmp_test = y_test.values.reshape(np.size(y_test), 1)
    tempb = pd.DataFrame(np.hstack([tmp_test, tmp_pred]))
    rmseb_test = np.sqrt(mean_squared_error(y_test, y_pred))

    #
    # visualize the results
    plts.use('ggplot')
    fig, ax = plt.subplots()
    #
    # default parameters
    ax.scatter(tempd.iloc[:, 0], tempd.iloc[:, 1], color='darkblue', label='default')
    #
    # optimized parameters
    ax.scatter(tempb.iloc[:, 0], tempb.iloc[:, 1], marker='x', color='crimson', label='optimized')
    #
    # reference
    xtmp = np.array([np.min(tempd.iloc[:, 1]), np.max(tempd.iloc[:, 1])])
    ytmp = xtmp.copy()
    ax.plot(xtmp, ytmp, label='reference', color='gray')
    #
    # show scores and correlation rates
    strd = 'score: ' + str(scored) + '   rmse: ' + str(rmsed_test)
    strb = 'score: ' + str(scoreb) + '   rmse: ' + str(rmseb_test)
    ax.text(xtmp.min(), ytmp.max(), strd, color='darkblue')
    ax.text(xtmp.min(), ytmp.max()-1,strb, color='crimson')
    #
    # graphical setting
    ax.legend(loc='lower right')
    ax.set_xlabel('actual temperature')
    ax.set_ylabel('predicted temperature')
    fig.suptitle('mountain temperature prediction')

    plt.show()
    #
    # save the trained model
    with open('MountTempModel.pkl', mode='wb') as f:
        pickle.dump(gbrb, f)

No comments:

Post a Comment

AWS - Build A Serverless Web App

 ‘Run your application without servers’. The idea presented by the cloud service providers is fascinating. Of course, an application runs on...