+tech Blog

Last updated on:

To develop a machine model in response to a business case, we easily think of the following process.

In general, we collect relevant business data, decide features and labels if applicable, cleanse and transform the data, and select the features if necessary. Then, we fit the model to the data and validate it. Accuracy is a good metric for a classification model, whereas R-squared score is recommended for a regression model. The trained model, which is supposed to have learned the patterns in the data, is used to predict new data. Finally, we deploy the model in the production environment. Hence, business intelligence is gained in place.

According to a specific business case, the process, as well as the detailed tasks in each activity, may change in practice. Moreover, given the nature of tuning the model’s performance, we may iteratively conduct the whole or part of the process. Nevertheless, keeping it in mind would guide us through the development.

The Iris dataset, for example, is a classical classification case in machine learning. It was first published in a paper by Prof. Fisher in 1936, “The use of multiple measurements in taxonomic problems.” The data itself was gathered and organized by Prof. Anderson, who systematically examined Iris flowers in Quebec, Canada, to study evolutionary biology.

Iris has three species such as Setosa, Versicolor and Virginica. The dataset consists of 150 samples with 50 for each type. Each sample comes with four features, sepal length, sepal width, petal length and petal width, which are measured by centimeters.

If we interpret this from the perspective of engineering, it could by summarized as follows.

- Features are in the similar scale, sharing the same physical unit;

- Number of samples, 150 > Number of features, 4;

- All values of the label are available, a supervised learning case;

- A multiclass classification, 3 types.

Since it is a well-prepared dataset, we won’t need to carry out data cleansing , missing data handling, feature categorizing, and so on.

However, how the features distribute is something we shouldn’t miss out. Most scikit-learn models are designed to perform better when features follow a Standard Normal Distribution which is also called Gaussian Distribution, as shown in the below graph.

We can use Matplotlib, a powerful graphical library, to visualize the distribution of the Iris features. The graphs below are generated by its Hist artist.

It is hardly to say any of them is an ideal normal distribution. The Sepal’s two distributions on the top are, to some extent, closer to the “bell shape”; the left one is sort of skewed, though. On the contrary, the Petal’s two seem to have separate peaks, more like a bimodal shape. As we move forward, we’ll try a few data transformations, in an attempt to map them closer to a normal distribution. At the same time, how this could affect model’s performance is another aspect we’d like to observe.

The data is clean, and our objective is to classify the instances based on the above features in this case. Next, would we immediately fit the model? We often do this. Wait. Let’s explore more aspects in developing a supervised model, from a broader perspective.

What if we get tens of features in a real business case? We may want to look into which features matter more, in other words, the importance of features. Yes, feature selection is a topic we are going to take on.

Feature Selection

First, let’s try univariate feature selection. This can be done with class SelectKBest, which selects features based on the univariate statistical test. For more information on SelectKBest, refer to: Feature selection.

As the chart tells, Petal length takes the highest score, and petal width comes next.

Another way to select features can be conducted by using a model. If we have already known which model will be implemented for the business case, we can simply use that model. If not, we can try a linear model such as LogisticRegression or LinearSVC for classification, and Lasso for regression. As mentioned previously, this is an iterative development process. We could come back and refine it as we get more and more insights into the case.

To do this, instantiate a model, pass it to SelectFromModel as a parameter, and then fit SelectFromModel to the Iris dataset. In the example below, the charts show the coefficients which are assigned to the features by a LinearSVC estimator. The higher the value, the more important the feature is. We can specify the maximum number of features to select at the same time, for example, 2. As a result, petal length and petal width are suggested by the selector.

Data Transformation

Data transformation, a math job, is an essential task that we should take into account in preprocessing. Why would we need to transform data in place? The main reason is because most models are designed to perform better when the data follows the normal distribution. In fact, some models like linear SVM models are sensitive to the data whether it is standardized or not. For instance, when SelectFromModel is used alongside a LinearSVC estimator to select features, the coefficients change greatly associated with a standardized dataset and a non-standardized dataset. As for data, not many sets are as easy as Iris. When we get such a dataset at work, we would lean back in the chair and think to grab a coffee. Taking California housing data as an example, look, we get columns such as income, house age, average number of rooms, average number of bedrooms, population, average number of members, latitude and longitude. They take different measurement units; and their values vary largely. Standardized scaling is recommended before it is fed into a regressor. Additionally, we may have date and time fields, text fields, categorical fields and so on, that we need to convert to numerical values.

Scikit learn provides with a bunch of data transformation tools. StandardScaler, a commonly used one, transforms data to a set with the mean of 0 and the standard deviation of 1. Technically, it doesn’t change the distribution shape, but it shrinks the range of the feature’s values.

There are non-linear transformers available as well, such as PowerTransformer and QuantileTransformer. PowerTransformer maps data closer to a normal distribution; whereas QuantileTransformer can map it to a normal distribution or a uniform distribution.

The graph below shows what the distributions of several common data samples look like after transformed by these tools. Especially look at the Bimodal data, it appears like a Gaussian distribution after Quantile transformation. On the other hand, the transformation may raise a concern that it could distort relationships within and across the features.

For more information about how to generate this graph, see: Map data to a normal distribution

In this walkthrough, we’ll train several models with a standardized set, a Box-Coxed set and a quantiled set, and see what we get. The following charts illustrate how the features distribute after transformation.

Model Selection

Choosing the right estimator is critical in a machine learning case. Scikit learn provides a reference process for model selection. Refer to: Choosing the right estimator. Let’s follow it step by step to perform the initial selection for Iris.

Step 1: > 50 samples? Yes.

Step 2: Predicting a category? Yes.

Step 3: Do you have labeled data? Yes

Step 4: <100K samples? Yes

Step 5: LinearSVC. Up to this step, we’ve reached a recommended model; however, we’d like to move further to pick more potential models. So, we can compare them in the later section.

Step 6: Text data? No.

Step 7: KNeighbors Classifier. Good, we get another one. Keep moving on.

Step 8: SVC and Ensemble Classifiers. Ensemble classifiers refer to DecisionTreeClassifier, RandomForestClassifier, GradientBoostingClassifier and etc., commonly used in the field. If we categorize Scikit models in a simple manner, we can group them into simple models, for example LinearSVC and LogisticRegression, and complex models, for example SVC and ensemble classifiers. Excellent. We’ll try both simple and complex models for Iris classification. Here is the full list of the initially picked models.

- LinearSVC
- LogisticRegression
- SVC(linear): Linear kernel
- SVC(rbf): Rbf kernel
- DecisionTreeClassifier
- RandomForestClassifier
- GradientBoostingClassifier

At this moment, we are still stuck in a bunch of abstract names. How does a model decide an instance’s type? I am not talking about the inner mathematical algorithm, because during the whole process we are not allowed to touch it. Even for the parameters, the model itself takes care of them in the train phase. Indeed, we can only try a variety of hyper-parameter combinations to find the best estimator. The point here is that we need something visible or touchable to help us build a concept in our mind. To get an idea of this, we can visualize the decision boundary of a model.

Remember, when we did the feature selection, petal length and petal width were suggested as the two most important features. We’re going to use them as the input data, and use the DecisionBoundaryDisplay class to visualize the decision boundaries of all the selected models.

Note: An 80-to-20 ratio is used to split the train-test data.

For more information about how to draw a decision boundary, See: Models comparison.

The linear models construct straight lines to classify Iris; the SVC(rbf) model builds circular boundaries; the decision tree and the ensembles compose maze-like blocks. Apparently, their decision areas differ in shape. The colors in the charts above represent Iris classes – When an instance falls into the green area, it is classified as Setosa; red area, Versicolor; purple area, Virginica.

This, visually and conceptually, helps us understand how an estimator does the classification. But it won’t tell us which estimator is fitter for Iris, because we don’t know how the features can be related to the label. A linear relationship or a non-linear relationship? Or to be stricter, we assume there are some patterns hidden in the data, machine learning is an effective means to find out them, and hence predict new data based on them. Through the process of gaining this business intelligence and subsequently contributing to business decision making, we are reliant on models. Needless to say, we do need to make sure the models learn the patterns rather than noises in the data.

Tools to Evaluate a Model

Scikit learn prepared tools to evaluate the quality of a classifier’s output. For example, if it is a binary problem, we can draw a ROC curve to see its performance; if a multiple classes problem, we can check the confusion matrix. Deep down, accuracy of prediction is the point we are looking at.

Receiver Operating Characteristic (ROC) Curve

In a nutshell, if the area under the curve (AUC) is larger, the estimator performs better. The top left point represents the best status, meaning all positive are predicted as positive and no negative are predicted as positive. For Iris data, if we modify the target values a little, for example, we assume the target value equal to 1 as True, the others, 0 and 2, as False, it becomes a binary problem. With this modified dataset, we plot the ROC curves for the models such as LinearSVC, SVC and RandomForestClassifier, in the below chart. This can be done by using the RocCurveDisplay class. As the chart shows, SVC has the largest AUC, so we can assume it performs best in this case.

Confusion Matrix

The diagonal values of the confusion matrix indicate correct predictions. The higher they are, the better the estimator. It works for a multi-class problem. Using Iris as an example, we use the ConfusionMatrixDisplay class to draw the confusion matrices for the picked models, which came out as shown in the following view. In this example, the test data contains 10 Setosa, 8 Versicolor and 12 Virginica samples.

The confusion matrix can be normalized. The diagonal value is calculated by using this formula: True Positive / (True Positive + False Negative). It follows the same rule, the greater the values, the better the estimator’s performance. One stands for the ideal result. Below are the normalized ones for those models.

Confusion Matrix with Normalization

Model Comparation

In this section, we’ll put all the selected models to work. We’ll train the model with not only the original dataset but also the transformed ones, which are the standardized set, Box-Coxed set and quantiled set. In the meanwhile, cross-validation with 4 folds will be carried out in order to check if overfitting occurs in the models.

The results are visualized in the graph below, with the train score as the X axis and the test score as the Y axis. If a solid circle is under the dotted reference line, it means the train score is greater than the test score. If above, the train score is smaller.

The means of the train scores and test scores by model by dataset are summarized in the following table.

Dataset	Model	Mean of Train Scores	Mean of Test Scores
Original	LinearSVC	0.973	0.947
	LogisticRegression	0.975	0.967
	SVC(linear)	0.984	0.960
	SVC(rbf)	0.969	0.960
	DecisionTreeClassifier	1.000	0.940
	RandomForestClassifier	1.000	0.947
	GradientBoostingClassifier	1.000	0.954
Standardized	LinearSVC	0.958	0.934
	LogisticRegression	0.976	0.954
	SVC(linear)	0.982	0.940
	SVC(rbf)	0.980	0.960
	DecisionTreeClassifier	1.000	0.940
	RandomForestClassifier	1.000	0.947
	GradientBoostingClassifier	1.000	0.954
Box-Coxed	LinearSVC	0.958	0.920
	LogisticRegression	0.976	0.947
	SVC(linear)	0.982	0.947
	SVC(rbf)	0.973	0.954
	DecisionTreeClassifier	1.000	0.940
	RandomForestClassifier	1.000	0.947
	GradientBoostingClassifier	1.000	0.954
Quantiled	LinearSVC	0.962	0.934
	LogisticRegression	0.982	0.954
	SVC(linear)	0.986	0.967
	SVC(rbf)	0.982	0.967
	DecisionTreeClassifier	1.000	0.940
	RandomForestClassifier	1.000	0.947
	GradientBoostingClassifier	1.000	0.954

An ideal model should be able to find the generic pattern, rather than the noise and specific details of the data. So that the model possesses good generalization which makes it be able to predict unknown data with consistence. The test score, showing the quality of an estimator’s generalization, is the key indicator that we should look at. The train score is important as well, as it indicates how well an estimator learns the patterns in the training data. The ideal scenario is that both the train score and the test score are high, and the delta, a discrepancy between them, is trivial. A high train score and a much lower test score usually suggest overfitting, meaning the model performs well on the training data but badly on new data due to lack of generalization. For instance, a case with the train score of 1 but the test score of less than 1, suggests a typical overfit. If both the train score and the test score are low, it indicates underfitting since the model failed to capture the underlying relationships in the data. What if a lower train score and a higher test score? It is unlikely. It may suggest that data split is not well done for this particular case.

The complex models except for SVC(rbf), DecisionTreeClassifier and RandomForestClassifier and GradientBoostingClassifier, repeated the typical pattern for all the folders during the cross validation. Therefore, they are dropped out and will not be included in the grid search phase.

SVC(linear) submitted the highest test scores for each dataset; however, it’s hard to say it is the best model at this point since the hyper-parameters of each model haven’t been optimized yet.

As a subsequence, the three simple models such as LinearSVC and LogisticRegression and SVC(linear), along with the complex model SVC(rbf) are taken to the grid search phase.

For more information about overfitting, refer to: Overfitting vs Underfitting.

Grid search

Before diving into GridSearch, we’d like to get an idea of which hyper-parameter matters more to a model’s performance when it varies.

C Regularization

As C increases, both the train score and test score go up, and then reach a plateau. For SVC, the test scores started going down when C exceeds a certain point, as illustrated by the following charts.

Apparently, C has a big effect on model’s performance.

Gamma

For SVC with a linear kernel, the change in gamma doesn’t influence the model’s performance at all.

For SVC with a rbf kernel, both the train and test scores pick up slowly, then the test score starts decreasing when gamma becomes greater than 0.1. As gamma grows beyond 10, it dives sharply.

Tolerance

All the models show the same pattern, moving from a high plateau to a low plateau when tolerance increases beyond a threshold.

Max Iteration

As long as a model can converge with enough iterations, the max_iter hyper-parameter seems to have no effect on a model’s accuracy.

Learning Curve

Early on, increasing the size of the training set results in higher scores. After it reaches 80, the train score doesn’t go up any more, although the learn curve doesn’t take a beautiful straight line. It suggests that more samples will no longer improve the model’s accuracy. This applies to all models. If without explicitly mentioning, an 80 to 20 ratio is used to split the Iris dataset.

C, the indicator of regularization, is the key hyper-parameter we are going to grid-search. Others like penalty, loss, solver and etc., haven’t been included in the picture. They are designed for particular purposes. For example, penalty in LinearSVC, L1 conducts automatic feature selection by driving irrelevant feature weights to exactly zero, whereas L2 pushes all weights toward zero without setting them to zero, suitable for a scenario when all features are useful. For Iris, introducing either penalty or loss into the grid search didn’t produce an improved result. The details are omitted from this article.

Exhaustive Grid Search and Cross Validation

Below are the grids designed for the candidate models.

Model	C	gamma
LinearSVC	np.logspace(-3, 3, 7)	-
LogisticRegression	np.logspace(-3, 3, 7)	-
SVC(linear)	np.logspace(-3, 3, 7)	-
SVC(rbf)	np.logspace(-3, 3, 7)	np.logspace(-3, 1, 5)

Run an exhaustive grid search. The results are summarized in the table below.

Dataset	Model	Train Score	Test Score	BestEstimator
Original	LinearSVC	0.967	0.933	LinearSVC(C=1.0)
	LogisticRegression	0.992	0.967	LogisticRegression(C=10.0)
	SVC(linear)	0.992	0.967	SVC(kernel='linear', C=1.0)
	SVC(rbf)	0.983	0.967	SVC(C=10.0, gamma=0.01)
Standardized	LinearSVC	0.967	0.967	LinearSVC(C=1000.0)
	LogisticRegression	0.983	0.967	LogisticRegression(C=10.0)
	SVC(linear)	0.983	0.967	SVC(C=1000.0, kernel='linear')
	SVC(rbf)	0.975	0.967	SVC(C=100.0, gamma=0.01)
Box-Coxed	LinearSVC	0.967	0.967	LinearSVC(C=1000.0)
	LogisticRegression	0.983	0.967	LogisticRegression(C=10.0)
	SVC(linear)	0.983	0.967	SVC(kernel='linear', C=1.0)
	SVC(rbf)	0.992	0.967	SVC(C=1000.0, gamma=0.001)
Quantiled	LinearSVC	0.975	0.967	LinearSVC(C=10.0)
	LogisticRegression	0.992	0.967	LogisticRegression(C=10.0)
	SVC(linear)	0.992	0.967	SVC(kernel='linear', C=1.0)
	SVC(rbf)	0.992	1.000	SVC(C=100.0, gamma=0.001)

*Note: The dual hyper-parameter in LinearSVC is set to ‘auto’.

At first glance, high train scores, high test scores, low deltas, they look pretty good. But for the SVC model, we also got a case that the test score is greater than the train score. Perhaps the information in the test data has already been included in the training set, for this particular data split.

Before conclude which model performs better for the classification, we’re going to collect more performance data. Cross validation is an effective approach we can take. We divide the Iris data into four folds, and every time pick three folds as the training set and leave one as the test set, continue this operation until every fold has been tested. So, totally we run 4 iterations. We can compare the average scores of each model.

The cross_validate class encapsulated in Scikit learn package is able to provide execution time as well, such as fit time and score time. The cross-validation results including scores and execution time are illustrated in the chart and the table below.

Dataset	Model	Mean of Train Score	Mean of Test Score	Mean of Fit Time	Mean of Score Time
Original	LinearSVC	0.973	0.947	0.004	0.002
	LogisticRegression	0.987	0.960	0.061	0.000
	SVC(linear)	0.984	0.961	0.002	0.000
	SVC(rbf)	0.969	0.961	0.002	0.001
Standardized	LinearSVC	0.987	0.934	0.008	0.002
	LogisticRegression	0.982	0.947	0.020	0.002
	SVC(linear)	0.991	0.947	0.005	0.003
	SVC(rbf)	0.976	0.940	0.005	0.002
Box-Coxed	LinearSVC	0.989	0.934	0.010	0.001
	LogisticRegression	0.984	0.947	0.039	0.003
	SVC(linear)	0.982	0.947	0.025	0.005
	SVC(rbf)	0.976	0.940	0.031	0.004
Quantiled	LinearSVC	0.978	0.934	0.014	0.005
	LogisticRegression	0.984	0.954	0.029	0.008
	SVC(linear)	0.987	0.967	0.012	0.008
	SVC(rbf)	0.978	0.974	0.017	0.011

Scores

Both test scores and train scores are high while the test scores are slightly lower. It seems to be an ideal scenario. Let’s take a further look at each model’s performance.

For the quantiled dataset, SVC(rbf) takes the lead, scored 0.974; SVC(linear) follows.

For the original dataset, both SVC models, SVC(rbf) and SVC(linear), scored 0.961.

At the same time, LogisticRegression and SVC(linear) dominate on the standardized and box-coxed datasets, marked 0.947.

Another interesting thing we observed here is that, standardizing data doesn’t improve models’ performance. All models gave lower test scores than they did on the original dataset. This could be explained by the nature of Iris data. All the features are numerical lengths and widths, sharing the same physical unit. Standardization appears unnecessary.

Execution Time

Another aspect to compare models’ performance is how much computing resource they consume, especially when the number of samples and the number of features grow largely. As for Iris, giving the small scale, it may not be critical to shed the spotlight on the execution time. But we’d like to include this into scope on purpose of observing a more complete development process.

In most cases, LinearSVC achieved less score time than the other models, although the difference is very limited. LogisticRegression is the next.

So, here comes the recommendation.

If we simply look at the scores, SVC(rbf) seems to be a better option. The data needs to be quantile transformed.

On the other hand, SVC(linear) achieves either top or second highest scores across all the datasets. On the original set, which is a reasonable choice given the nature of Iris features, SVC(linear) outperforms SVC(rbf) in execution time. Hence, it is recommended as a more balanced option.

Model Persistence

It seems fine to persist and deploy the model to production. What if more fresh data comes in? Yes, data scaling could trigger to re-train the model, or even results in a different model by going through the selection process. But at this point for the case, it is proper to save the model for future use.

Sciket learn offers a bunch of tools to persist a trained model. Among them, pickle is often used in Python. For more information, refer to: Model persistence.

What about algorithms inside models?

The algorithm implemented inside a model actually is the most important part, the core. But we prioritized it low in this topic. As users, we apply these pre-designed models for a variety of business scenarios. There is not much we can do to the inner algorithms. Another reason is that profound expertise is required when it comes to model development. We’re lucky as Scikit learn developers work on them.

Using linear regression as an example, we could get a glimpse of it.

A problem is modeled by using a linear equation as follows:

Where {x} stands for the features, Y represents the predicted targets, {w} is the coefficients, and w0 is the interception.

If only one feature, it is simplified as:

The goal is to find the optimal parameters, {w} and w0, so that the equation best describes the relationship between the input and the target variable.

How? Through iterative mathematical calculations, we minimize the mean squared error (MSE) until it reaches the tolerant value.

, sum of squares between the observed targets y and the predicted targets Y.

For more information, see: Linear Models.