+tech Blog: Machine Learning - Build and Compare Classification Models

When it comes to machine learning, there are a wide range of models available in the field. With no doubt, it will be a huge project in terms of efforts and time if someone tries to walk them through. So I've been thinking it may be a good idea if we get them work first with example Python codes. As to concept, underlying math, business scenarios, profits, concerns etc., we can pick up with more researches later when we work on a specific business case. The merit to do this is that we get to know, these models, probably just a little, and we can gain hands-on programming in the first place. It would lay a foundation for further development as needed.

I happened to read a book, Python Statistics & Machine Learning Mastering Handbook, authored by Team Karupo. The book introduces a bunch of models with concise texts and examples covering supervised and unsupervised and deep learning estimators.

Inspired by it, I planned to build a series of models at once that can be flexibly selected and visualize their performance on the same graph. Moreover, I would like to approach it from the engineering angle and make the process as simple as we feed the input dataset, then get the visual evaluation results, like the graph shown below. (Please be noted that the process of developing and tuning a model is not the subject we are going to address here. If you are interested in that, please check out Machine Learning - Build A GradientBoostingRegressor Model for more details. )

For each model, we'll do the prediction for both the test data and the training data, and calculate their accuracy scores as well. Then display them on the chart where a model relates to a solid circle. X axis represents the accuracy score for the training data, whereas Y axis represents the accuracy score for the test data.

At the same time, use the score() method provided by the model to get the score for this specific case, texted right above the circle.

Models to Be Compared

Models to be compared are stored in the table ModelList, as shown below.

Name field bears the name of a model.

Category field defines which group the model belongs to, either Classification or Regression.

Parameter and Value fields define parameters to be passed to the model in its creation. Parameter field holds the name of the parameter, and Value field carries the value associated with that parameter. You can add as many parameters as you need for a model. On the other hand, if you don't specify any parameters for a model, the model will take the default parameters.

PreCalc is an indicator showing if we need to carry out extra calculations on the input data before it is passed to a model. For example, we should use a polynomial to transform the data before feed it into Lasso regression model. We'll discuss it more in another blog Machine Learning - Build And Compare Regression Models.

Dataset

The dataset is kept in a table called DatasetCls, which consists of only the feature columns and the label column. Any descriptive columns have to be removed. Additionally, the label column has to appear last. DatasetCls can be either a normal table or an external table.

That is all, we don't have more rules for it.

Dataset Generated by make_classification()

Main Flow

The main flow, illustrated on the following diagram, is implemented in function CompareClassificationModels() that you can find in the latter section Appendix: Source Code.

Please be noted that we'll standardize the training data and the test data before feed them into the models.

Main Flow

How Are Models Created?

Model instantiation is implemented in class ClassifierFactory packaged in modelfactory.py. Please refer to Appendix: Source Code section.

The factory method newclassifier() creates and returns a model with specified parameters. The models defined inside the class are all the classifiers that this factory can produce so far. You can add or delete models based on your needs.

The method execute() will call fit() and predict() and score() on the model. Additionally, it will call accuracy_score() to evaluate the model's performance on the given datasets.

How to Add a Model?

This is pretty straightforward. We can get it done by appending a record to the table ModelList. For example, we would like to add LogisticRegression to the list and specify random_state parameter at the same time. So we can execute the following SQL statement.

insert into modellist values('LogisticRegression', 'Classification', null, 'random_state', '123');

Then we re-run the program, as you can see, LogisticRegression appears on the graph.

How to Switch the Dataset?

For example, we would like to use the wine dataset coming with sklearn. If we have the csv file on hand, we can create an external table using the SQL script below. If the script doesn't work in your environment, please double check if your csv file has the right encoding.

create table datasetcls (
fixed_acidity number,
volatile_acidity number,
citric_acid number,
residual_sugar number,
chlorides number,
free_sulfur_dioxide number,
total_sulfur_dioxide number,
density number,
pH number,
sulphates number,
alcohol number,
quality number
)
organization external
(
type oracle_loader
default directory externalfile
access parameters
(
records delimited by newline
nobadfile
nologfile
fields terminated by ';'
)
location ('winequality-red.csv')
)
reject limit unlimited
;

Wine Dataset

We can do the same for Iris dataset.

Iris Dataset

For the dataset processing, we used the hard-coded parameters. Obviously there is more room for improvements. In response to needs in the field, surely we can add more customized functionalities.
Additionally, the similar work for the regression models will be undertaken and summarized in another blog.

Reference

Python Statistics & Machine Learning Mastering Handbook Team Karupo

Choosing the right estimator

Machine Learning - Build A GradientBoostingRegressor Model

Machine Learning - Build And Compare Regression Models

Appendix: Source Code

CompareClassificationModels()

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

import oracledb

import logging

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

from modelfactory import ClassifierFactory

def CompareClassificationModels():

"""

The function instantiates a series of classification models pre-defined in a table

and fit the models and computes the predictions for a specific dataset

and visualize the results.

:parameter:

No parameters are required.

:raise:

An oracledb.Error will be raised if anything goes wrong with the database.

:return:

This function doesn't return a value.

"""

# ------------------------------------------------------------------

# configure a logger

# ------------------------------------------------------------------

format = '%(asctime)s %(levelname)-10s [%(threadName)s] [%(module)s] [%(funcName)-30s] %(message)s'

logger = logging.getLogger('modellogger')

handler = logging.StreamHandler()

fmt = logging.Formatter(format)

handler.setFormatter(fmt=fmt)

logger.setLevel(logging.DEBUG)

logger.addHandler(handler)

logger.info('process starts')

# ------------------------------------------------------------------

# load the list of models and the dataset from the database

# ------------------------------------------------------------------

model_columns = ['Name', 'Category', 'PreCalc', 'Parameter', 'Value']

sqlmodel = "select * from modellist where category='Classification'"

sqldata = 'select * from datasetcls'

try:

with oracledb.connect(user="test", password='1234', dsn="localhost/xepdb1") as conn:

with conn.cursor() as cursor:

df_model = pd.DataFrame(cursor.execute(sqlmodel), columns=model_columns)

df_data = pd.DataFrame(cursor.execute(sqldata))

except oracledb.Error as e:

logger.error(f'Failed to fetch data from the database ({str(e)})')

return

logger.debug(f"list of the models\n{df_model}")

logger.debug(f"head of the dataset\n{df_data.head()}")

# ------------------------------------------------------------------

# data pre-processing

# ------------------------------------------------------------------

df_data = df_data.dropna()

# separate the dataset into features and labels

x, y = df_data.iloc[:, 0:-1], df_data.iloc[:, -1]

logger.debug(f"the features\n{x}")

logger.debug(f"the labels\n{y}")

# default parameters for dataset

test_size = 0.3

standardize = True

random_state = 123

# ------------------------------------------------------------------

# prepare the training dataset and the test dataset, standardize

# ------------------------------------------------------------------

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_size, random_state=random_state)

if standardize:

logger.info('Standardize the training and test datasets')

sc = StandardScaler()

x_train = sc.fit_transform(x_train)

x_test = sc.transform(x_test)

# ------------------------------------------------------------------

# create and fit and evaluate the models

# ------------------------------------------------------------------

model_list = df_model.Name.unique().tolist()

eval_res = []

if len(model_list) > 0:

model_list.sort()

mf = ClassifierFactory()

for name in model_list:

# create a model

model, desc = mf.newclassifier(name, df_model)

if model is not None:

logger.info(desc)

# execute fit, predict, evaluation

score_model, acc_train, acc_test = mf.execute(model, x_train, y_train, x_test, y_test)

eval_res.append([name, score_model, acc_train, acc_test])

logger.info(f' [Score: {score_model} Accuracy (Training): {acc_train} Accuracy (Test): {acc_test}]')

else:

logger.warning(desc)

score_model, acc_train, acc_test = None, None, None

logger.warning(f' [Score: {score_model} Accuracy (Training): {acc_train} Accuracy (Test): {acc_test}]')

# ------------------------------------------------------------------

# visualize the evaluation results

# ------------------------------------------------------------------

if len(eval_res) > 0:

logger.info('Visualize the evaluation results')

columns = ['Model', 'Score', 'Accuracy (Training)', 'Accuracy (Test)']

eval_res.sort()

df = pd.DataFrame(eval_res, columns=columns)

logger.debug(f"evaluation results\n{df}")

plt.grid(which='major')

# reference

xr = np.array(

[round(pd.DataFrame.min(df.iloc[:, 2:3]) - 0.05, 1), round(pd.DataFrame.max(df.iloc[:, 2:3]) + 0.05, 1)])

yr = xr.copy()

plt.plot(xr, yr, color='#D2D5D1')

sns.scatterplot(

data=df,

x='Accuracy (Training)',

y='Accuracy (Test)',

marker='o',

hue=df['Model']

)

plt.plot()

dev = (xr.max() - xr.min()) / 100.0

for eval in df.values:

score = round(eval[1], 2)

x = eval[2] + dev

y = eval[3] + dev

plt.text(x, y, score)

plt.legend(loc='best')

plt.suptitle('Compare Classification Models')

plt.show()

else:

logger.info('No evaluation results are available for visualization')

logger.info('process finishes')

modelfactory.py

import numpy as np

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import GradientBoostingClassifier

from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import RandomForestClassifier

from sklearn.svm import SVC

from sklearn.svm import LinearSVC

from sklearn.metrics import accuracy_score

class ClassifierFactory():

__dtc = 'DecisionTreeClassifier'

__gbc = 'GradientBoostingClassifier'

__lrc = 'LogisticRegression'

__rfc = 'RandomForestClassifier'

__svc = 'SVC'

__svl = 'LinearSVC'

@property

def dtc(self):

return self.__dtc

@property

def gbc(self):

return self.__gbc

@property

def lrc(self):

return self.__lrc

@property

def rfc(self):

return self.__rfc

@property

def svc(self):

return self.__svc

@property

def svl(self):

return self.__svl

def newclassifier(self, name, params):

"""

Acting as a factory method, new a classification model with the parameters specified by params.

:param name: str,

Name of the model to be instantiated.

:param params: pandas.DataFrame,

Parameters passed to the model in the creation.

Contains at least Name and Parameter and Value columns.

:return:

model: object, Instance of the model |

desc: str, description of the model

"""

# random_state for DecisionTreeClassifier, GradientBoostingClassifier

# RandomForestClassifier, SVC, LinearSVC

random_state = 0

# max_depth for DecisionTreeClassifier

max_depth = None

# learning_rate for GradientBoostingClassifier

learning_rate = 0.2

# n_estimators for RandomForestClassifier

n_estimators = 100

# max_iter for LinearSVC

max_iter = 5000

# DecisionTreeClassifier

if name == self.dtc:

df_dtc = params.loc[params.Name == self.dtc, :]

if len(df_dtc) > 0:

if 'max_depth' in df_dtc.Parameter.unique():

max_depth = int(df_dtc.query("Parameter=='max_depth'").Value.tolist()[0])

if 'random_state' in df_dtc.Parameter.unique():

random_state = int(df_dtc.query("Parameter=='random_state'").Value.tolist()[0])

desc = f'{name}(max_depth={max_depth}, random_state={random_state})'

model = DecisionTreeClassifier(max_depth=max_depth,

random_state=random_state

)

# GradientBoostingClassifier

elif name == self.gbc:

df_gbc = params.loc[params.Name == self.gbc, :]

if len(df_gbc) > 0:

if 'random_state' in df_gbc.Parameter.unique():

random_state = int(df_gbc.query("Parameter=='random_state'").Value.tolist()[0])

if 'learning_rate' in df_gbc.Parameter.unique():

learning_rate = float(df_gbc.query("Parameter=='learning_rate'").Value.tolist()[0])

desc = f'{name}(random_state={random_state}, learning_rate={learning_rate})'

model = GradientBoostingClassifier(random_state=random_state,

learning_rate=learning_rate

)

# LogisticRegression

elif name == self.lrc:

df_lrc = params.loc[params.Name == self.lrc, :]

if len(df_lrc) > 0:

if 'random_state' in df_lrc.Parameter.unique():

random_state = int(df_lrc.query("Parameter=='random_state'").Value.tolist()[0])

desc = f'{name}(random_state={random_state})'

model = LogisticRegression(random_state=random_state

)

# RandomForestClassifier

elif name == self.rfc:

df_rfc = params.loc[params.Name == self.rfc, :]

if len(df_rfc) > 0:

if 'n_estimators' in df_rfc.Parameter.unique():

n_estimators = int(df_rfc.query("Parameter=='n_estimators'").Value.tolist()[0])

if 'random_state' in df_rfc.Parameter.unique():

random_state = int(df_rfc.query("Parameter=='random_state'").Value.tolist()[0])

desc = f'{name}(n_estimators={n_estimators}, random_state={random_state})'

model = RandomForestClassifier(n_estimators=n_estimators,

random_state=random_state

)

# SVC

elif name == self.svc:

df_svc = params.loc[params.Name == self.svc, :]

if len(df_svc) > 0:

if 'random_state' in df_svc.Parameter.unique():

random_state = int(df_svc.query("Parameter=='random_state'").Value.tolist()[0])

desc = f'{name}(random_state={random_state})'

model = SVC(random_state=random_state

)

# LinearSVC

elif name == self.svl:

df_svl = params.loc[params.Name == self.svl, :]

if len(df_svl) > 0:

if 'random_state' in df_svl.Parameter.unique():

random_state = int(df_svl.query("Parameter=='random_state'").Value.tolist()[0])

if 'max_iter' in df_svl.Parameter.unique():

max_iter = int(df_svl.query("Parameter=='max_iter'").Value.tolist()[0])

desc = f'{name}(random_state={random_state}, max_iter={max_iter})'

model = LinearSVC(random_state=random_state,

max_iter=max_iter

)

# Undefined

else:

desc = f'{name} is not implemented as of now'

model = None

return model, desc

def execute(self, model, x_train, y_train, x_test, y_test):

"""

Calls fit and predict and score on the model.

Computes accuracy score for the training data and the test data

:param model: instance of the model

:param x_train: the training features

:param y_train: the training labels

:param x_test: the test features

:param y_test: the test labels

:return: score_model: float, score of the test data |

acc_train: float, accuracy score of the training data |

acc_test: float, accuracy score of the test data

"""

score_model = None

acc_train = None

acc_test = None

if model is not None:

# learning

model.fit(x_train, y_train)

# predict

y_test_pred = model.predict(x_test)

y_train_pred = model.predict(x_train)

# evaluate

score_model = model.score(x_test, y_test)

acc_train = accuracy_score(y_train, y_train_pred)

acc_test = accuracy_score(y_test, y_test_pred)

return score_model, acc_train, acc_test

+tech Blog

Friday, September 1, 2023

Machine Learning - Build and Compare Classification Models

Models to Be Compared

Dataset

Main Flow

How Are Models Created?

How to Add a Model?

How to Switch the Dataset?

Reference

Appendix: Source Code

No comments:

Post a Comment

Oracle - JOINs

Labels

Wikipedia