Friday, September 1, 2023

Machine Learning - Build and Compare Regression Models


This is a continued blog following Build And Compare Classification Models. We are going to build and compare a bunch of regression models in this post.

As for the program, we use the same mechanism. A factory class named RegressorFactory takes on tasks such as instantiating a model and fitting the model and predicting and evaluating the model.

Function CompareRegressionModels() takes the responsibility of implementing the work flow.

Along with the score (Coefficient of determination) provided by a model itself, root mean squared error (RMSE) is another indicator chosen to evaluate the models. You can find them on the comparison graph. 

A solid circle represents a model on the chart. The score is displayed near the circle. The RMSEs of the training data and the test data stand for X, Y axis, respectively.


make_regression Dataset


Another example for California Housing dataset.


California Housing



Models to Be Compared

The regression models are also administered in the table ModelList with Category set to Regression, as shown on the screenshot below.

PreCalc indicator is set to 1 if the model requires polynomial calculation, otherwise set to NULL.

The pair of Parameter and Value define the parameters passed to the model in its creation. The table structure allows you to add up to unlimited parameters. What if we specified a duplicate parameter? The first one will be picked out and passed to the model. 

 


Dataset

Dataset should be loaded into table DatasetReg. Please make sure you only add the feature columns and the label column to the table and place the label column in the last. 

The sample is a dataset generated by make_regression() method.

 

make_regression Dataset



Main Flow

The main flow is realized in function CompareRegressionModels(), shown in the diagram below. 

But one thing we need to pay attention to, some models such as Lasso, Polynomial and Ridge require transforming the input data with polynomial matrix before it is fed into the models. So, after standardize the input data, we call PolynomialFeatures() to prepare polynomial calculation matrix, then pass the standardized data to the calculation matrix and get the output. The output will be passed to that group of models. As a result, the flow becomes slightly different.




How to Add a Model?

In case you want to add more models, you can insert the corresponding records into the table ModelList using SQL scripts, or whatever database tool. And please make sure the new model has been included in class RegressorFactory. Otherwise, you will get a warning message saying the model is not implemented as of now.


insert into modellist values('Multiple', 'Regression', null, '', '');
insert into modellist values('Polynomial', 'Regression', 1, '', '');
insert into modellist values('Ridge', 'Regression', 1, 'alpha', '0.1');
insert into modellist values('Ridge', 'Regression', 1, 'random_state', '123');


The newly added models pop up on the graph.



make_regression Dataset


How to Switch the Dataset?

The program fetches data from table DatasetReg, so the data must be moved into DatasetReg. Here is an example for your reference.

- Create an external table named Dataset_Housing.

create table Dataset_Housing (
   MedInc number
  ,HouseAge number
  ,AveRooms number
  ,AveBedrms number
  ,Population number
  ,AveOccup number
  ,Latitude number
  ,Longitude number
  ,Price number
)
organization external
(
  type oracle_loader
  default directory externalfile
  access parameters
  (
    records delimited by newline
    nobadfile
    nologfile
    fields terminated by ','
  )
  location ('cal_housing.csv')
reject limit unlimited
;

- Drop table DatasetReg.

- Create table DatasetReg from Dataset_Housing.




Relook at Mountain Temperature Prediction

GradientBoostingRegressor is used for mountain temperature prediction in the blog Machine Learning - Build A GradientBoostingRegressor Model. In effect it overfit the training dataset. So we run all these regression models on the same dataset this time. 

As you can see from the evaluation graph, DecisionTreeRegressor and Polynomial and RandomForestRegressor and Ridge tend to be overfitting as well for this particular predictive case. 

 




Appendix: Source Code


CompareRegressionModels()

import numpy as np
import pandas as pd
import matplotlib.pyplot as  plt
import seaborn as sns
import oracledb
import logging

from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from modelfactory import RegressorFactory

def CompareRegressionModels():
    """
    The function instantiates a series of regression models pre-defined in a table
    and fit the models and computes the predictions for a specific dataset
    and visualize the results.

    :parameter:
    No parameters are required.

    :raise:
    An oracledb.Error will be raised if anything goes wrong with the database.

    :return:
    This function doesn't return a value.
    """
    # ------------------------------------------------------------------
    # configure a logger
    # ------------------------------------------------------------------
    format = '%(asctime)s %(levelname)-10s [%(threadName)s] [%(module)s] [%(funcName)-30s] %(message)s'
    logger = logging.getLogger('modellogger')
    handler = logging.StreamHandler()
    fmt = logging.Formatter(format)
    handler.setFormatter(fmt=fmt)
    logger.setLevel(logging.DEBUG)
    logger.addHandler(handler)
    logger.info('process starts')

    # ------------------------------------------------------------------
    # load the list of models and the dataset from the database
    # ------------------------------------------------------------------
    model_columns = ['Name', 'Category', 'PreCalc', 'Parameter', 'Value']
    sqlmodel = "select * from modellist where category='Regression'"
    sqldata = 'select * from datasetreg'

    try:
        with oracledb.connect(user="test", password='1234', dsn="localhost/xepdb1") as conn:
            with conn.cursor() as cursor:
                df_model = pd.DataFrame(cursor.execute(sqlmodel), columns=model_columns)
                df_data = pd.DataFrame(cursor.execute(sqldata))
    except oracledb.Error as e:
        logger.error(f'Failed to fetch data from the database ({str(e)})')
        return

    logger.debug(f"list of the models\n{df_model}")
    logger.debug(f"head of the dataset\n{df_data.head()}")

    # ------------------------------------------------------------------
    # data pre-processing
    # ------------------------------------------------------------------
    df_data = df_data.dropna()
    # visualize the input data
    # df_data.hist(bins=50)
    # plt.show()

    # separate the dataset into features and labels
    x, y = df_data.iloc[:, 0:-1], df_data.iloc[:, -1]
    logger.debug(f"the features\n{x}")
    logger.debug(f"the labels\n{y}")

    #
    # default parameters for dataset
    test_size = 0.3
    standardize = True
    random_state = 123
    # degree for polynomial features
    degree = 3

    # ------------------------------------------------------------------
    # prepare the training dataset and the test dataset and standardize
    # ------------------------------------------------------------------
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_size, random_state=random_state)

    # standardize the input data
    logger.info('Standardize the training and test datasets')
    if standardize:
        sc = StandardScaler()
        x_train= sc.fit_transform(x_train)
        x_test = sc.transform(x_test)

    # ------------------------------------------------------------------
    # prepare polynomially transformed data if needed
    # ------------------------------------------------------------------
    if df_model.query('PreCalc == 1').size > 0:
        poly = PolynomialFeatures(degree=degree, include_bias=False)
        x_train_ply = poly.fit_transform(x_train)
        x_test_ply = poly.fit_transform(x_test)

    # ------------------------------------------------------------------
    # create and fit and evaluate the models
    # ------------------------------------------------------------------
    model_list = df_model.Name.unique().tolist()
    model_list.sort()
    eval_res = []
    if len(model_list) > 0:
        mf = RegressorFactory()

        for name in model_list:
            # create a model
            model, desc = mf.newregressor(name, df_model)

            if model is not None:
                logger.info(desc)

                # execute fit, predict, evaluation
                if df_model.loc[df_model.Name == name, 'PreCalc'].unique()[0] > 0:
                    score_model, rmse_train, rmse_test = mf.execute(model, x_train_ply, y_train, x_test_ply, y_test)
                else:
                    score_model, rmse_train, rmse_test = mf.execute(model, x_train, y_train, x_test, y_test)

                eval_res.append([name, score_model, rmse_train, rmse_test])
                logger.info(f'  [Score: {score_model} RMSE (Training): {rmse_train} RMSE (Test): {rmse_test}]')
            else:
                logger.warning(desc)

                score_model, rmse_train, rmse_test = None, None, None
                logger.warning(f'  [Score: {score_model} RMSE (Training): {rmse_train} RMSE (Test): {rmse_test}]')

    # ------------------------------------------------------------------
    # visualize the evaluation results
    # ------------------------------------------------------------------
    if len(eval_res) > 0:
        logger.info('Visualize the evaluation results')

        columns = ['Model', 'Score', 'RMSE (Training)', 'RMSE (Test)']
        eval_res.sort()
        df = pd.DataFrame(eval_res, columns=columns)
        logger.debug(f"evaluation results\n{df}")

        plt.grid()
        #
        # reference
        xr = np.array([round(pd.DataFrame.min(df.iloc[:, 2:3])-0.05, 1), round(pd.DataFrame.max(df.iloc[:, 2:3])+0.05, 1)])
        yr = xr.copy()
        plt.plot(xr, yr, color='#D2D5D1')
        sns.scatterplot(
            data=df,
            x='RMSE (Training)',
            y='RMSE (Test)',
            marker='o',
            hue=df['Model']
        )

        plt.plot()

        dev = (xr.max() - xr.min()) / 100.0
        for eval in df.values:
            score = round(eval[1], 2)
            x = eval[2] + dev
            y= eval[3] + dev
            plt.text(x, y, score)

        plt.legend(loc='best')
        plt.suptitle('Compare Regression Models')
        plt.show()

    else:
        logger.info('No evaluation results are available for visualization')

    logger.info('process finishes')


modelfactory.py

import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import Lasso
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error

class RegressorFactory():
    __dtr = 'DecisionTreeRegressor'
    __gbr = 'GradientBoostingRegressor'
    __las = 'Lasso'
    __mtp = 'Multiple'
    __ply = 'Polynomial'
    __rfr = 'RandomForestRegressor'
    __rdg = 'Ridge'
    __svr = 'SVR(rbf)'
    __svl = 'SVR(linear)'

    @property
    def dtr(self):
        return self.__dtr

    @property
    def gbr(self):
        return self.__gbr

    @property
    def las(self):
        return self.__las

    @property
    def mtp(self):
        return self.__mtp

    @property
    def ply(self):
        return self.__ply

    @property
    def rfr(self):
        return self.__rfr

    @property
    def rdg(self):
        return self.__rdg

    @property
    def svr(self):
        return self.__svr

    @property
    def svl(self):
        return self.__svl

    def newregressor(self, name, params):
        """
        Acting as a factory method, new a regression model with the parameters specified by params.

        :param name: str,
            Name of the model to be instantiated.

        :param params: pandas.DataFrame,
            Parameters passed to the model in the creation.
            Contains at least Name and Parameter and Value columns.

        :return:
            model: object, Instance of the model |
            desc: str, description of the model
        """
        # random_state for DecisionTreeRegressor, GradientBoostingRegressor
        #     RandomForestRegressor, Lasso, Ridge
        random_state = 0
        # max_depth for DecisionTreeRegressor
        max_depth = None
        # learning_rate for GradientBoostingRegressor
        learning_rate = 0.1
        # n_estimators for GradientBoostingRegressor, RandomForestRegressor
        n_estimators = 100
        # C for SVF
        c = 1.0
        # alpha for Lasso
        alpha = 1.0

        # DecisionTreeRegressor
        if name == self.dtr:
            df_dtr = params.loc[params.Name == self.dtr, :]

            if len(df_dtr) > 0:
                if 'max_depth' in df_dtr.Parameter.unique():
                    max_depth = int(df_dtr.query("Parameter=='max_depth'").Value.tolist()[0])
                if 'random_state' in df_dtr.Parameter.unique():
                    random_state = int(df_dtr.query("Parameter=='random_state'").Value.tolist()[0])

            desc = f'{name}(max_depth={max_depth}, random_state={random_state})'

            model = DecisionTreeRegressor(max_depth=max_depth,
                                          random_state=random_state
                                          )
        # GradientBoostingRegressor
        elif name == self.gbr:
            df_gbr = params.loc[params.Name == self.gbr, :]

            if len(df_gbr) > 0:
                if 'random_state' in df_gbr.Parameter.unique():
                    random_state = int(df_gbr.query("Parameter=='random_state'").Value.tolist()[0])
                if 'learning_rate' in df_gbr.Parameter.unique():
                    learning_rate = float(df_gbr.query("Parameter=='learning_rate'").Value.tolist()[0])
                if 'n_estimators' in df_gbr.Parameter.unique():
                    n_estimators = int(df_gbr.query("Parameter=='n_estimators'").Value.tolist()[0])

            desc = f'{name}(random_state={random_state}, learning_rate={learning_rate}, n_estimators={n_estimators})'

            model = GradientBoostingRegressor(random_state=random_state,
                                              learning_rate=learning_rate,
                                              n_estimators=n_estimators
                                              )
        # Lasso
        elif name == self.las:
            df_las = params.loc[params.Name == self.las, :]

            if len(df_las) >= 0:
                if 'alpha' in df_las.Parameter.unique():
                    alpha = float(df_las.query("Parameter=='alpha'").Value.tolist()[0])
                if 'random_state' in df_las.Parameter.unique():
                    random_state = int(df_las.query("Parameter=='random_state'").Value.tolist()[0])

            desc = f'{name}(alpha={alpha}, random_state={random_state})'

            model = Lasso(alpha=alpha,
                          random_state=random_state
                          )
        # Multiple
        elif name == self.mtp:
            df_mtp = params.loc[params.Name == self.mtp, :]

            if len(df_mtp.Parameter.unique()) > 0:
                pass

            desc = f'{name}()'

            model = LinearRegression()
        # Polynomial
        elif name == self.ply:
            df_ply = params.loc[params.Name == self.ply, :]

            if len(df_ply) >= 0:
                pass

            desc = f'{name}()'

            model = LinearRegression()
        # RandomForestRegressor
        elif name == self.rfr:
            df_rfr = params.loc[params.Name == self.rfr, :]

            if len(df_rfr) > 0:
                if 'n_estimators' in df_rfr.Parameter.unique():
                    n_estimators = int(df_rfr.query("Parameter=='n_estimators'").Value.tolist()[0])
                if 'random_state' in df_rfr.Parameter.unique():
                    random_state = int(df_rfr.query("Parameter=='random_state'").Value.tolist()[0])

            desc = f'{name}(n_estimators={n_estimators}, random_state={random_state})'

            model = RandomForestRegressor(n_estimators=n_estimators,
                                          random_state=random_state
                                          )
        # Ridge
        elif name == self.rdg:
            df_rdg = params.loc[params.Name == self.rdg, :]

            if len(df_rdg) >= 0:
                if 'alpha' in df_rdg.Parameter.unique():
                    alpha = float(df_rdg.query("Parameter=='alpha'").Value.tolist()[0])
                if 'random_state' in df_rdg.Parameter.unique():
                    random_state = int(df_rdg.query("Parameter=='random_state'").Value.tolist()[0])

            desc = f'{name}(alpha={alpha}, random_state={random_state})'

            model = Ridge(alpha=alpha,
                          random_state=random_state
                          )
        # SVR(rbf)
        elif name == self.svr:
            df_svr = params.loc[params.Name == self.svr, :]

            if len(df_svr) > 0:
                if 'C' in df_svr.Parameter.unique():
                    c = float(df_svr.query("Parameter=='C'").Value.tolist()[0])

            desc = f"{name}(C={c}, kernel='rbf')"

            model = SVR(C=c,
                        kernel='rbf'
                        )
        # SVR(linear)
        elif name == self.svl:
            df_svl = params.loc[params.Name == self.svl, :]

            if len(df_svl) > 0:
                if 'C' in df_svl.Parameter.unique():
                    c = float(df_svl.query("Parameter=='C'").Value.tolist()[0])

            desc = f"{name}(C={c}, kernel='linear')"

            model = SVR(C=c,
                        kernel='linear'
                        )
        # undefined
        else:
            desc = f'{name} is not implemented as of now'
            model = None

        return model, desc

    def execute(self, model, x_train, y_train, x_test, y_test):
        """
        Calls fit and predict and score on the model.
        Computes root mean squared errors (RMSE) for the training data and the test data

        :param model: instance of the model
        :param x_train: the training features
        :param y_train: the training labels
        :param x_test: the test features
        :param y_test: the test labels
        :return: score_model: float, score of the test data, |
            rmse_train: float, RMSE of the training data, |
            rmse_test: float, RMSE of the test data
        """
        score_model = None
        rmse_train = None
        rmse_test = None

        if model is not None:
            # learning
            model.fit(x_train, y_train)

            # predict
            y_test_pred = model.predict(x_test)
            y_train_pred = model.predict(x_train)

            # evaluate
            score_model = model.score(x_test, y_test)
            rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
            rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))

        return score_model, rmse_train, rmse_test

No comments:

Post a Comment

AWS - Build A Serverless Web App

 ‘Run your application without servers’. The idea presented by the cloud service providers is fascinating. Of course, an application runs on...