Friday, September 1, 2023

Machine Learning - Build and Compare Classification Models


When it comes to machine learning, there are a wide range of models available in the field. With no doubt, it will be a huge project in terms of efforts and time if someone tries to walk them through. So I've been thinking it may be a good idea if we get them work first with example Python codes. As to concept, underlying math, business scenarios, profits, concerns etc., we can pick up with more researches later when we work on a specific business case. The merit to do this is that we get to know, these models, probably just a little, and we can gain hands-on programming in the first place. It would lay a foundation for further development as needed. 

I happened to read a book, Python Statistics & Machine Learning Mastering Handbook, authored by Team Karupo. The book introduces a bunch of models with concise texts and examples covering supervised and unsupervised and deep learning estimators. 

Inspired by it, I planned to build a series of models at once that can be flexibly selected and visualize their performance on the same graph. Moreover, I would like to approach it from the engineering angle and make the process as simple as we feed the input dataset, then get the visual evaluation results, like the graph shown below. (Please be noted that the process of developing and tuning a model is not the subject we are going to address here. If you are interested in that, please check out Machine Learning - Build A GradientBoostingRegressor Model for more details. )



For each model, we'll do the prediction for both the test data and the training data, and calculate their  accuracy scores as well. Then display them on the chart where a model relates to a solid circle. X axis represents the accuracy score for the training data, whereas Y axis represents the accuracy score for the test data. 

At the same time, use the score() method provided by the model to get the score for this specific case,  texted right above the circle.



Models to Be Compared

Models to be compared are stored in the table ModelList, as shown below. 

Name field bears the name of a model. 

Category field defines which group the model belongs to, either Classification or Regression. 

Parameter and Value fields define parameters to be passed to the model in its creation. Parameter field holds the name of the parameter, and Value field carries the value associated with that parameter. You can add as many parameters as you need for a model. On the other hand, if you don't specify any parameters for a model, the model will take the default parameters.

PreCalc is an indicator showing if we need to carry out extra calculations on the input data before it is passed to a model. For example, we should use a polynomial to transform the data before feed it into Lasso regression model. We'll discuss it more in another blog Machine Learning - Build And Compare Regression Models.




Dataset

The dataset is kept in a table called DatasetCls, which consists of only the feature columns and the label column. Any descriptive columns have to be removed. Additionally, the label column has to appear last. DatasetCls can be either a normal table or an external table.

That is all, we don't have more rules for it. 


Dataset Generated by make_classification()



Main Flow

The main flow, illustrated on the following diagram, is implemented in  function CompareClassificationModels() that you can find in the latter section Appendix: Source Code.

Please be noted that we'll standardize the training data and the test data before feed them into the models.


Main Flow



How Are Models Created?

Model instantiation is implemented in class ClassifierFactory packaged in modelfactory.py. Please refer to Appendix: Source Code section.

The factory method newclassifier() creates and returns a model with specified parameters. The models defined inside the class are all the classifiers that this factory can produce so far. You can add or delete models based on your needs.

The method execute() will call fit() and predict() and score() on the model. Additionally, it will call  accuracy_score() to evaluate the model's performance on the given datasets.



How to Add a Model?

This is pretty straightforward. We can get it done by appending a record to the table ModelList. For example, we would like to add LogisticRegression to the list and specify random_state parameter at the same time. So we can execute the following SQL statement.

insert into modellist values('LogisticRegression', 'Classification', null, 'random_state', '123');



Then we re-run the program, as you can see, LogisticRegression appears on the graph. 




How to Switch the Dataset?

For example, we would like to use the wine dataset coming with sklearn. If we have the csv file on hand, we can create an external table using the SQL script below. If the script doesn't work in your environment, please double check if your csv file has the right encoding. 

create table datasetcls (
  fixed_acidity number,
  volatile_acidity number,
  citric_acid number,
  residual_sugar number,
  chlorides number,
  free_sulfur_dioxide number,
  total_sulfur_dioxide number,
  density number,
  pH number,
  sulphates number,
  alcohol number,
  quality number
)
organization external
(
  type oracle_loader
  default directory externalfile
  access parameters
  (
    records delimited by newline
    nobadfile
    nologfile
    fields terminated by ';'
  )
  location ('winequality-red.csv')
)
reject limit unlimited
;

Wine Dataset

We can do the same for Iris dataset.


Iris Dataset

For the dataset processing, we used the hard-coded parameters. Obviously there is more room for improvements. In response to needs in the field, surely we can add more customized functionalities.

Additionally, the similar work for the regression models will be undertaken and summarized in another blog. 



Reference

Python Statistics & Machine Learning Mastering Handbook  Team Karupo

Choosing the right estimator

Machine Learning - Build A GradientBoostingRegressor Model

Machine Learning - Build And Compare Regression Models



Appendix: Source Code


CompareClassificationModels()

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import oracledb
import logging
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from modelfactory import ClassifierFactory


def CompareClassificationModels():
    """
    The function instantiates a series of classification models pre-defined in a table
    and fit the models and computes the predictions for a specific dataset
    and visualize the results.

    :parameter:
    No parameters are required.

    :raise:
    An oracledb.Error will be raised if anything goes wrong with the database.

    :return:
    This function doesn't return a value.
    """
    # ------------------------------------------------------------------
    # configure a logger
    # ------------------------------------------------------------------
    format = '%(asctime)s %(levelname)-10s [%(threadName)s] [%(module)s] [%(funcName)-30s] %(message)s'
    logger = logging.getLogger('modellogger')
    handler = logging.StreamHandler()
    fmt = logging.Formatter(format)
    handler.setFormatter(fmt=fmt)
    logger.setLevel(logging.DEBUG)
    logger.addHandler(handler)
    logger.info('process starts')

    # ------------------------------------------------------------------
    # load the list of models and the dataset from the database
    # ------------------------------------------------------------------
    model_columns = ['Name', 'Category', 'PreCalc', 'Parameter', 'Value']
    sqlmodel = "select * from modellist where category='Classification'"
    sqldata = 'select * from datasetcls'

    try:
        with oracledb.connect(user="test", password='1234', dsn="localhost/xepdb1") as conn:
            with conn.cursor() as cursor:
                df_model = pd.DataFrame(cursor.execute(sqlmodel), columns=model_columns)
                df_data = pd.DataFrame(cursor.execute(sqldata))
    except oracledb.Error as e:
        logger.error(f'Failed to fetch data from the database ({str(e)})')
        return

    logger.debug(f"list of the models\n{df_model}")
    logger.debug(f"head of the dataset\n{df_data.head()}")

    # ------------------------------------------------------------------
    # data pre-processing
    # ------------------------------------------------------------------
    df_data = df_data.dropna()

    # separate the dataset into features and labels
    x, y = df_data.iloc[:, 0:-1], df_data.iloc[:, -1]
    logger.debug(f"the features\n{x}")
    logger.debug(f"the labels\n{y}")

    #
    # default parameters for dataset
    test_size = 0.3
    standardize = True
    random_state = 123

    # ------------------------------------------------------------------
    # prepare the training dataset and the test dataset, standardize
    # ------------------------------------------------------------------
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_size, random_state=random_state)

    if standardize:
        logger.info('Standardize the training and test datasets')
        sc = StandardScaler()
        x_train = sc.fit_transform(x_train)
        x_test = sc.transform(x_test)

    # ------------------------------------------------------------------
    # create and fit and evaluate the models
    # ------------------------------------------------------------------
    model_list = df_model.Name.unique().tolist()
    eval_res = []
    if len(model_list) > 0:

        model_list.sort()
        mf = ClassifierFactory()

        for name in model_list:
            # create a model

            model, desc = mf.newclassifier(name, df_model)

            if model is not None:
                logger.info(desc)

                # execute fit, predict, evaluation
                score_model, acc_train, acc_test = mf.execute(model, x_train, y_train, x_test, y_test)

                eval_res.append([name, score_model, acc_train, acc_test])
                logger.info(f'  [Score: {score_model} Accuracy (Training): {acc_train} Accuracy (Test): {acc_test}]')
            else:
                logger.warning(desc)
                score_model, acc_train, acc_test = None, None, None
                logger.warning(f'  [Score: {score_model} Accuracy (Training): {acc_train} Accuracy (Test): {acc_test}]')

    # ------------------------------------------------------------------
    # visualize the evaluation results
    # ------------------------------------------------------------------
    if len(eval_res) > 0:
        logger.info('Visualize the evaluation results')

        columns = ['Model', 'Score', 'Accuracy (Training)', 'Accuracy (Test)']
        eval_res.sort()
        df = pd.DataFrame(eval_res, columns=columns)

        logger.debug(f"evaluation results\n{df}")

        plt.grid(which='major')

        #
        # reference
        xr = np.array(
            [round(pd.DataFrame.min(df.iloc[:, 2:3]) - 0.05, 1), round(pd.DataFrame.max(df.iloc[:, 2:3]) + 0.05, 1)])
        yr = xr.copy()
        plt.plot(xr, yr, color='#D2D5D1')

        sns.scatterplot(
            data=df,
            x='Accuracy (Training)',
            y='Accuracy (Test)',
            marker='o',
            hue=df['Model']
        )

        plt.plot()

        dev = (xr.max() - xr.min()) / 100.0
        for eval in df.values:
            score = round(eval[1], 2)
            x = eval[2] + dev
            y = eval[3] + dev
            plt.text(x, y, score)

        plt.legend(loc='best')
        plt.suptitle('Compare Classification Models')
        plt.show()

    else:
        logger.info('No evaluation results are available for visualization')

    logger.info('process finishes')



modelfactory.py

import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score


class ClassifierFactory():
    __dtc = 'DecisionTreeClassifier'
    __gbc = 'GradientBoostingClassifier'
    __lrc = 'LogisticRegression'
    __rfc = 'RandomForestClassifier'
    __svc = 'SVC'
    __svl = 'LinearSVC'

    @property
    def dtc(self):
        return self.__dtc

    @property
    def gbc(self):
        return self.__gbc

    @property
    def lrc(self):
        return self.__lrc

    @property
    def rfc(self):
        return self.__rfc

    @property
    def svc(self):
        return self.__svc

    @property
    def svl(self):
        return self.__svl

    def newclassifier(self, name, params):
        """
       Acting as a factory method, new a classification model with the parameters specified by params.
       :param name: str,
           Name of the model to be instantiated.
       :param params: pandas.DataFrame,
           Parameters passed to the model in the creation.
           Contains at least Name and Parameter and Value columns.
       :return:
           model: object, Instance of the model |
           desc: str, description of the model
       """

        # random_state for DecisionTreeClassifier, GradientBoostingClassifier
        #    RandomForestClassifier, SVC, LinearSVC

        random_state = 0
        # max_depth for DecisionTreeClassifier
        max_depth = None
        # learning_rate for GradientBoostingClassifier
        learning_rate = 0.2
        # n_estimators for RandomForestClassifier
        n_estimators = 100
        # max_iter for LinearSVC
        max_iter = 5000
    
        # DecisionTreeClassifier
        if name == self.dtc:
            df_dtc = params.loc[params.Name == self.dtc, :]
    
            if len(df_dtc) > 0:
                if 'max_depth' in df_dtc.Parameter.unique():
                    max_depth = int(df_dtc.query("Parameter=='max_depth'").Value.tolist()[0])
                if 'random_state' in df_dtc.Parameter.unique():
                    random_state = int(df_dtc.query("Parameter=='random_state'").Value.tolist()[0])
    
            desc = f'{name}(max_depth={max_depth}, random_state={random_state})'
    
            model = DecisionTreeClassifier(max_depth=max_depth,
                                           random_state=random_state
                                           )
        # GradientBoostingClassifier
        elif name == self.gbc:
            df_gbc = params.loc[params.Name == self.gbc, :]
    
            if len(df_gbc) > 0:
                if 'random_state' in df_gbc.Parameter.unique():
                    random_state = int(df_gbc.query("Parameter=='random_state'").Value.tolist()[0])
                if 'learning_rate' in df_gbc.Parameter.unique():
                    learning_rate = float(df_gbc.query("Parameter=='learning_rate'").Value.tolist()[0])
    
            desc = f'{name}(random_state={random_state}, learning_rate={learning_rate})'
    
            model = GradientBoostingClassifier(random_state=random_state,
                                               learning_rate=learning_rate
                                               )
        # LogisticRegression
        elif name == self.lrc:
            df_lrc = params.loc[params.Name == self.lrc, :]
    
            if len(df_lrc) > 0:
                if 'random_state' in df_lrc.Parameter.unique():
                    random_state = int(df_lrc.query("Parameter=='random_state'").Value.tolist()[0])
    
            desc = f'{name}(random_state={random_state})'
    
            model = LogisticRegression(random_state=random_state
                                       )
        # RandomForestClassifier
        elif name == self.rfc:
            df_rfc = params.loc[params.Name == self.rfc, :]
    
            if len(df_rfc) > 0:
                if 'n_estimators' in df_rfc.Parameter.unique():
                    n_estimators = int(df_rfc.query("Parameter=='n_estimators'").Value.tolist()[0])
                if 'random_state' in df_rfc.Parameter.unique():
                    random_state = int(df_rfc.query("Parameter=='random_state'").Value.tolist()[0])
    
            desc = f'{name}(n_estimators={n_estimators}, random_state={random_state})'
    
            model = RandomForestClassifier(n_estimators=n_estimators,
                                           random_state=random_state
                                           )
        # SVC
        elif name == self.svc:
            df_svc = params.loc[params.Name == self.svc, :]
    
            if len(df_svc) > 0:
                if 'random_state' in df_svc.Parameter.unique():
                    random_state = int(df_svc.query("Parameter=='random_state'").Value.tolist()[0])
    
            desc = f'{name}(random_state={random_state})'
    
            model = SVC(random_state=random_state
                        )
        # LinearSVC
        elif name == self.svl:
            df_svl = params.loc[params.Name == self.svl, :]
    
            if len(df_svl) > 0:
                if 'random_state' in df_svl.Parameter.unique():
                    random_state = int(df_svl.query("Parameter=='random_state'").Value.tolist()[0])
                if 'max_iter' in df_svl.Parameter.unique():
                    max_iter = int(df_svl.query("Parameter=='max_iter'").Value.tolist()[0])
    
            desc = f'{name}(random_state={random_state}, max_iter={max_iter})'
    
            model = LinearSVC(random_state=random_state,
                              max_iter=max_iter
                              )
        # Undefined
        else:
            desc = f'{name} is not implemented as of now'
            model = None
    
        return model, desc


def execute(self, model, x_train, y_train, x_test, y_test):
    """

    Calls fit and predict and score on the model.
    Computes accuracy score for the training data and the test data
    :param model: instance of the model
    :param x_train: the training features
    :param y_train: the training labels
    :param x_test: the test features
    :param y_test: the test labels
    :return: score_model: float, score of the test data |
        acc_train: float, accuracy score of the training data |
        acc_test: float, accuracy score of the test data
    """
    score_model = None
    acc_train = None

    acc_test = None

    if model is not None:
        # learning
        model.fit(x_train, y_train)

        # predict
        y_test_pred = model.predict(x_test)
        y_train_pred = model.predict(x_train)

        # evaluate
        score_model = model.score(x_test, y_test)
        acc_train = accuracy_score(y_train, y_train_pred)
        acc_test = accuracy_score(y_test, y_test_pred)

    return score_model, acc_train, acc_test

No comments:

Post a Comment

AWS - Build A Serverless Web App

 ‘Run your application without servers’. The idea presented by the cloud service providers is fascinating. Of course, an application runs on...