Supervised Learning

Author

Brit Davidson

Supervised Learning

Introduction to Supervised Learning

In contrast to our previous weeks, we are now looking at supervised learning, which has a different approach, set of questions, and set up than unsupervised learning. Supervised learning is where we are training models using a labelled dataset.Hence, the model is trained by example. In it’s simplest form, we will have a dataset with the target variable (i.e., the variable we are predicting) and we split this into a training set and a test set. The training set is what we use to train the model, and once the model is trained, we test it using the test set. Essentially we can then evaluate how well the model works by comparing the predicted and actual predicted values.

Another way to think about this, is to compare this idea to classical programming (see Figure below). Where in classical programming, rules are inputted (aka a program) and data is then processed according to these rules… and then outputs the answer or solution. In machine learning, data is inputted as well as the answers expected from the data, and then outputs a series of rules. These rules can then be applied to new data to produce predictions (aka a model).

Overview of Classical v Machine Learning Programming

So… in supervised ML… how do machines learn?

In its simplest form, to train a ML model, we need three key components:

Input data – the variables (e.g., images of cats and dogs)
Examples of the expected output (e.g., labels saying ‘dog’, ‘cat’)
Evaluation Methods – how do we know the model is performing well?

The ML model takes in the data and transforms the data into meaningful outputs. These models therefore try to find appropriate representations of the data. For example:

Learning Representations from Data If we look at the LHS image, we can see a series of black and white points. Our input data here are the coordinates of our points, the expected output are the colors of our points (e.g., black or white), and our evaluation will be the % of correctly classified points. The LHS is our raw data. In the middle, the model will transform the data in order to ‘split’ the data into homogeneous groups for prediction, and on the RHS providing a ‘better representation’ of the data in order to classify the points. Using this extremely simple example, looking at the RHS, assuming the x/y axes origin is [0,0], we can classify the white and black points easily with the following two rules:

Black points; x >0
White points; x <0

Machine Learning Pipeline

Similar to clustering, there are a number of steps involved before you get to ML modelling. First, we need to prepare/clean/explore the data (e.g., look back to weeks 1-2), and this is where a lot of time will be spent. We then can consider what models are appropriate and consider those approaches. We can train models, critically evaluate the performance of them, and see how we can improve our model performances. However, the steps in the data analytics/ML pipeline are often varied and a lot of time can be spent iterating between them:

Types of Supervised ML

Broadly, there are two types of supervised ML in terms of the target variable:

Classification - where the target variable is categorical (e.g., blue, orange, purple). They can be further divided into:
- Binary Classification - where the target is binary (e.g., SPAM or HAM) or [0,1]
- Multi-class Classification - where the target has more than 2 classes (e.g., England, Scotland, Wales, N. Ireland, S. Ireland)
Regression - where the target variable is numerical (e.g., house prices)

In addition, supervised ML can have different learning styles:

Lazy - also known as instance-based learning and memory-based learning, where there the algorithm stores the training set in its original form without deriving general rules from it. When new data are fed into the algorithm, it searches for the training data most similar and produces the label. Here, we have less training time and more prediction time.
Eager - is an approach where the algorithm constructs a model during training. These methods try to uncover the relations and patterns hidden in training data. Hence, the resulting model is a compact and abstract representation of the training dataset used.

Classification

A couple notes

First we will use two approaches to our supervised learning. We will first use the original packages for each algorithm itself (e.g., ranger for random forests) to show you how these packages work. Then we will focus on mlr3, which is effectively a wrapper that allows us to have one consistent, neat pipeline of code for our ML. This overall will help you on your ML journey, where you will use a ‘wrapper’ like package, as it means everything you may want to do with your ML modelling (e.g., evaluation, tuning, train/test splits, inner/outer loop specs) are all in-built and we simply define them rather than needing to manually do this. Note: there are other wrapper packages, such as caret and tidymodels, but we won’t be looking at those. This course will use mlr3 as it continues to have (in my opinion) the highest functionality and to a degree looks and feels similar to python, which will aim to help you transition if you plan to.

Second – whenever you deploy any ML… UNDERSTAND HOW THE ALGO WORKS CONCEPTUALLY. Make sure you understand them and how to treat the data going into it…for example:

if you have a KNN as we talked about, they rely on distance measures – probably they benefit from scaling
if you scale your data before putting into trees… this is not needed and likely makes interpretation harder/worse
understand if you can put mixed variable types into the algorithm… many do not allow this

Evaluation Metrics for Classification

Before we jump into training models, it is important to show you some evaluation metrics so we can understand if our models (broadly) are performing well… Please note, this is not an exhaustive list. Next week we will look at some metrics to evaluate regression learning models.

Confusion Matrix

The Confusion Matrix is the visual representation of the Actual vs Predicted values. It is a performance evaluation tool for classification algorithms. This can be used for binary or multi-class task evaluation. It produces a table that looks like the figure below:

So on the LHS we have the predicted values that the model produced, and we compare those to the actual values. In other words, we use the training data and this contains the label for training the model… and when we test the model on the test data, we remove the labels as we are testing the model on unseen data, and then use those labels compare to evaluate the performance.

What the quadrants mean:

TP = True Positive. You predicted positive and it is true
TN = True Negative. You Predicted negative and it is true
FP = False Positive (type 1 error). You predicted positive and it is false
FN = False Negative (type 2 error). You predicted negative and it is false

Additional Measures from your Confusion Matrix…

Based on your confusion matrix, you can actually calculate a lot of additional metrics, such as Precision, Recall, F-Score… Let’s have a look at this further: Confusion Matrix and additional metrics

Here we can see several other metrics that tell you something different about your model. It is incredibly important to never only focus on one performance metric, but you should be looking across several to understand where your model performs well and where it may struggle.

The main ones that I think are useful to highlight from here are:

Precision: also known as the positive predictive value, is defined as the proportion of positive examples that are truly positive; in other words, when a model predicts the positive class, how often is it correct? A precise model will only predict the positive class in cases very likely to be positive
Sensitivity/Recall: is a measure of how complete the results are. This is defined as the number of true positives over the total number of positives. A more demonstrated example: if we had a dataset with people having a disease or not; sensitivity refers to a model’s ability to designate an individual with disease as positive. A highly sensitive test means that there are few false negative results, and thus fewer cases of disease are missed.
Specificity: In comparison to sensitivity above, specificity would test the models ability to designate an observation who does not have a disease as negative.
Classification Accuracy: is what we usually mean, when we use the term accuracy. It is the ratio of number of correct predictions to the total number of input samples. This tends to go hand in hand with classification error, where classification error is something we want to minimize, and it is often a % (e.g., 12% error), and it is the inverse to accuracy, which we want to maximize (e.g., 88% accurate).

Balanced Accuracy

Balanced accuracy can be used in both binary and multi-class classification. Essentially it is the arithmetic mean of sensitivity and specificity. It is particularly helpful to check when you have imbalanced data. Where: balanced accuracy = (sensitivity + specificity)/2.

Area Under the Curve - Receiver Operating Characteristics (AUC-ROC)

AUC-ROC is a performance measurement for the classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. Essentially, this tells us how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1. Broadly, an excellent model has AUC near to the 1,which means it has a good measure of separability. A poor model has an AUC near 0 which means it has the worst measure of separability.What this means it is reciprocating the result, aka predicting 0s as 1s and 1s as 0s. When AUC is 0.5, it means the model has no class separation capacity whatsoever, aka it is effectively a random classifier.

These are useful for binary problems, but you can adapt it for a multi-class problem… let’s say we are predicting apples, oranges, and blueberries. We can make this into multiple binary problems such as: apples v other, blueberries v other. Via this one v all, you can set up the problem to evaluate it with AUC.

Disentangling this further, the ROC is a curve of probability, so if we plot those distributions, we’ll get something like: A perfect classifier

Noting if we consider a classifier predicting whether someone has a disease (red) or not (green). Then, the red distribution curve is of the positive class (patients with disease) and the green distribution curve is of the negative class (patients with no disease). This shows the best outcome as it the model is working at 100% accuracy, thus it is perfectly able to distinguish between classes.

Note: if you actually have this in the real world, please check your model, as it is 99.9% likely to be doing something wrong…

In comparison, when a model is not operating with an AUC = 1, the distribution curves overlap, where we then introduce type 1 and type 2 errors. In the figures below, we have an AUC=0.7, which means there is a 70% chance the model can distinguish between classes. Classifier with AUC = 0.7 Looking at an AUC of 0.5, you’ve hopefully guessed what this will look like, but the two curves effectively overlay each other. Here you’ve got a random classifier, where it is 50/50 on being able to predict anything correctly.

Finally, to show you what hopefully shouldn’t ever happen: a classifier that is reciprocating the predictions (aka, predicting them backwards, as 0=1 and 1=0):

Precision-Recall Curve (PRAUC)

Briefly: much like the ROC curve, the precision-recall curve is used for evaluating the performance of binary classification algorithms. It is often used in situations where classes are heavily imbalanced. Also like ROC curves, precision-recall curves provide a graphical representation of a classifier’s performance across many thresholds, rather than a single value (e.g., accuracy, f-1 score, etc.). The precision-recall curve is constructed by calculating and plotting the precision against the recall for a single classifier at a variety of thresholds. When you plot these, you’ll see something like this:

Let’s now look at some algorithms.

k-Nearest Neighbours

The k-nearest neighbors algorithm, also known as KNN/kNN, k-NN, is a non-parametric, supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point. Aka, the algorithm will classify similar things together. Remember from the lecture, the KNN algorithm is a lazy learner, where it stores the training set verbatim. When new data are added, it on-the-fly classifies them, following the broad steps in the figure below:

Let’s try it out

We will use the iris dataset for simplicity and ease - using the e1071, caTools, and class packages.

library('e1071')
library('caTools')
library('class')

# we'll use the iris package and we'll be predicting species
data(iris)
head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

set.seed(127) #super important for reproducibility 
# so we need to manually split our data into a train/test split
# it is common to do 70:30
#there are many ways to do this -- this is one:
split <- sample.split(iris, SplitRatio = 0.7)
train_cl <- subset(iris, split == "TRUE")
test_cl <- subset(iris, split == "FALSE")

# It is common to do some form of Scaling - it's try it 
train_scale <- scale(train_cl[, 1:4]) #so this is of course scaling cols 1:4
test_scale <- scale(test_cl[, 1:4])

Now our data are split and set up… we can fit the model. As you hopefully recall from the lecture, we need to specify k, which can be done via trial and error and seeing what works best, or we can follow a rule-of-thumb. Here, for the sake of showing you the process and the impact on performance, we’ll try a bunch of different values for k.

# here we are fitting the model, where we call the knn algorithm command
# we define the test data, the train data, and define k 
classifier_knn <- knn(train = train_scale,
                    test = test_scale,
                    cl = train_cl$Species,
                    k = 1)

classifier_knn #this is the output

 [1] setosa     setosa     setosa     setosa     setosa     setosa    
 [7] setosa     setosa     setosa     setosa     setosa     setosa    
[13] setosa     setosa     setosa     setosa     setosa     setosa    
[19] setosa     setosa     versicolor virginica  versicolor versicolor
[25] versicolor versicolor versicolor versicolor virginica  versicolor
[31] versicolor versicolor versicolor versicolor versicolor versicolor
[37] versicolor versicolor versicolor versicolor virginica  virginica 
[43] virginica  virginica  virginica  virginica  virginica  versicolor
[49] virginica  virginica  virginica  virginica  virginica  versicolor
[55] virginica  virginica  virginica  virginica  virginica  virginica 
Levels: setosa versicolor virginica

# How do we know if the algo worked? We can look at a confusion matrix 
cm <- table(test_cl$Species, classifier_knn)
cm

            classifier_knn
             setosa versicolor virginica
  setosa         20          0         0
  versicolor      0         18         2
  virginica       0          2        18

# lets look at the accuracy/class error
#note this is quite a manual process! (there are likely other ways)
misClassError <- mean(classifier_knn != test_cl$Species)
print(paste('Accuracy =', 1-misClassError)) #notice the 1- to give the accuracy not error

[1] "Accuracy = 0.933333333333333"

print(paste('Classification Error =', misClassError))

[1] "Classification Error = 0.0666666666666667"

So here, with our k=1 value, lets interpret our confusion matrix. We have three classes here not only two. As you likely noticed, we are looking initial at the left top to right bottom diagonal. We want to see the biggest numbers here, showing correct classifications.

Lets look at each species. 20 setosa were correctly classified as setosa. Out of the 20 veriscolor, we can see 17 were predicted correctly as veriscolor, but 3 were misclassified as virginica. Out of the 20 virginica, 17 were correctly classified, but 3 were incorrectly classified as veriscolor.

Looking at the overall accuracy score - we have 0.93 or 93%, which is really good! This also means our classification error is 7%.

Let’s look at other values of k

A few things to think about:

As we decrease the value of k to 1, our predictions become less stable. Just think for a minute, imagine k=1 and we have a query point surrounded by several reds and one green, but the green is the single nearest neighbor. Reasonably, we would think the query point is most likely red, but because k=1, KNN incorrectly predicts that the query point is green.
Inversely, as we increase the value of k, our predictions become more stable due to majority voting / averaging, and thus, more likely to make more accurate predictions (up to a certain point). Eventually, we begin to witness an increasing number of errors. It is at this point we know we have pushed the value of K too far.
In cases where we are taking a majority vote (e.g. picking the mode in a classification problem) among labels, we usually make K an odd number to have a tiebreaker.

We will look at this visually after we’ve tested out different values to see the impact of changing k on the performance.

# K = 3
classifier_knn <- knn(train = train_scale,
                    test = test_scale,
                    cl = train_cl$Species,
                    k = 3)
misClassError <- mean(classifier_knn != test_cl$Species)
print(paste('Accuracy =', 1-misClassError))

[1] "Accuracy = 0.933333333333333"

# K = 5
classifier_knn <- knn(train = train_scale,
                    test = test_scale,
                    cl = train_cl$Species,
                    k = 5)
misClassError <- mean(classifier_knn != test_cl$Species)
print(paste('Accuracy =', 1-misClassError))

[1] "Accuracy = 0.95"

# K = 7
classifier_knn <- knn(train = train_scale,
                    test = test_scale,
                    cl = train_cl$Species,
                    k = 7)
misClassError <- mean(classifier_knn != test_cl$Species)
print(paste('Accuracy =', 1-misClassError))

[1] "Accuracy = 0.966666666666667"

# K = 15
classifier_knn <- knn(train = train_scale,
                    test = test_scale,
                    cl = train_cl$Species,
                    k = 15)
misClassError <- mean(classifier_knn != test_cl$Species)
print(paste('Accuracy =', 1-misClassError))

[1] "Accuracy = 0.983333333333333"

# K = 19
classifier_knn <- knn(train = train_scale,
                    test = test_scale,
                    cl = train_cl$Species,
                    k = 19)
misClassError <- mean(classifier_knn != test_cl$Species)
print(paste('Accuracy =', 1-misClassError))

[1] "Accuracy = 0.966666666666667"

Here we can see a lot of variance of performance with k. We can see that k=1 worked well, but increasing k helped.

As we mentioned before, these decision boundaries can be unstable with very low k values, so lets see what this actually looks like. I’ve created a plot below that shows the boundaries with different k values. Uoi ca see how the boundaries change when you use different k. You can see that as we increase k those boundaries are slowly becoming smoother. Note: this is a fairly small dataset!

Activities

rerun this without scaling. What is the impact?
try this using the diamonds dataset to predict cut
Note: some variables in diamonds are categorical, and you cannot put those into knn - why is that?

Machine Learning in R universe - MLR3

Now you’ve seen one example using the algo packages with knn, we will jump over to MLR3 now to continue. First, a little backgorund on MLR. When working with MLR3, there is a bit more set up to do, but it allows you to then easily exchange algorithms and work efficiently on your problems. I have linked the overall online book that comes with mlr3 and this will be your guide for future work. It is heavy and complex, so we will get you started with the set up and pipelines.

At the simplest level, there are two main things you need to set up to get started:

Task: Tasks are objects that contain the (usually tabular) data and additional metadata that define a machine learning problem. The metadata contain, for example, the name of the target feature (e.g., what you’re predicting) for supervised machine learning problems. This information is extracted automatically when required, so the user does not have to specify the prediction target every time a model is trained. So what type of tasks are there?
- Classification
- Regression
- And many more (e.g., survival, density, cluster…)
Learner: Objects of class Learner provide a unified interface to many popular machine learning algorithms in R. They consist of methods or alogrithm that you will use to train for a Task and provide meta-information about the learners. All learners have a two stage procedure - train and predict. There are lots of types of learners, we’ll focus on predefined ones (e.g., knn, naive bayes, DT).

Let’s look at the knn algo again

We will use the iris data again just for the sake of comparison and knn for the first example.

library('mlr3')
library('mlr3verse')


Attaching package: 'mlr3verse'

The following object is masked from 'package:e1071':

    tune

library('mlr3learners')
library('kknn')
library('dplyr')

head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

iris$Species = as.factor(iris$Species) #making sure this is a factor for our prediction

#lets create our task, using iris, and setting our target as Species
task = as_task_classif(iris,
                       target='Species') 

task # it is always good to print this and check it, is this what we want

<TaskClassif:iris> (150 x 5)
* Target: Species
* Properties: multiclass
* Features (4):
  - dbl (4): Petal.Length, Petal.Width, Sepal.Length, Sepal.Width

#make sure to look - is Species in the TARGET or feature list
#make sure it is in the TARGET 

set.seed(127) #critical, never forget to do this
#as if you dont, each time you run this, you'll get diff train/test splits
train_set = sample(task$row_ids, 0.7 * task$nrow)
test_set = setdiff(task$row_ids, train_set)

#lets choose our learner - knn
learner = lrn("classif.kknn")
learner #again print this, and check

<LearnerClassifKKNN:classif.kknn>
* Model: -
* Parameters: k=7
* Packages: mlr3, mlr3learners, kknn
* Predict Types:  [response], prob
* Feature Types: logical, integer, numeric, factor, ordered
* Properties: multiclass, twoclass

# note, it picked k=7 for us, we can change this in the learner

learner = lrn("classif.kknn", k=19)
learner #you can see k has changed and this is the one we'll use

<LearnerClassifKKNN:classif.kknn>
* Model: -
* Parameters: k=19
* Packages: mlr3, mlr3learners, kknn
* Predict Types:  [response], prob
* Feature Types: logical, integer, numeric, factor, ordered
* Properties: multiclass, twoclass

Next we can then train and test our model. First we train it, using the training data, then we can test it. We can use it to test on our training data initially and evaluate it. We will get the confusion matix and also pull through some performance metrics: accuracy and classifcation error. Note: there are lots more [metrics you can automatically pull](https://mlr3.mlr-org.com/reference/mlr_measures.html) - but it is CRITICAL you understand them and what they mean. We will go over more of these in the coming weeks.

# training
learner$train(task, row_ids = train_set)

#lets use the model to predict (training set!!)
pred_train = learner$predict(task, row_ids=train_set) # predicting 
pred_train$confusion #print the conf matrix

            truth
response     setosa versicolor virginica
  setosa         39          0         0
  versicolor      0         32         2
  virginica       0          0        32

measures = msrs(c('classif.acc', 'classif.ce')) #accuracy and class error
pred_train$score(measures) #print the scores

classif.acc  classif.ce 
 0.98095238  0.01904762

# use the model to predict on the TEST dataset 
pred_test = learner$predict(task, row_ids=test_set) #predicting
pred_test$confusion #get the confusion matrix

            truth
response     setosa versicolor virginica
  setosa         11          0         0
  versicolor      0         18         4
  virginica       0          0        12

measures = msrs(c('classif.acc', 'classif.ce')) #acc and error
pred_test$score(measures) #print

classif.acc  classif.ce 
 0.91111111  0.08888889

Activity

Change k and see how it impacts performance here
change the % train and test split and see what impact it has on the performance on the TEST dataset

Support Vector Machines

A support vector machine (SVM) is a type of supervised learning algorithm used in machine learning to solve classification and regression tasks. We will focus on the classification task. Generally, SVMs in their usual form can only be used in binary classification problems. (You can adapt your problem into a binary one, where the iris might be: target = setosa or NOT, rather than predicting each specie).

The aim of a support vector machine algorithm is to find the best possible line, or decision boundary, that separates the data points of different data classes. This boundary is called a hyperplane when working in high-dimensional feature spaces. The idea is to maximize the margin, which is the distance between the hyperplane and the closest data points of each category, thus making it easy to distinguish data classes. SVMs are useful for analyzing complex data that can’t be separated by a simple straight line. Called nonlinear SMVs, they do this by using a mathematical trick that transforms data into higher-dimensional space, where it is easier to find a boundary.

The key idea behind SVMs is to transform the input data into a higher-dimensional feature space. This transformation makes it easier to find a linear separation or to more effectively classify the data set. To do this, SVMs use a kernel function (e.g., linear, polynomial, sigmoid,…). Instead of explicitly calculating the coordinates of the transformed space, the kernel function enables the SVM to implicitly compute the dot products between the transformed feature vectors and avoid handling expensive, unnecessary computations for extreme cases.

SVMs can handle both linearly separable and non-linearly separable data. They do this by using different types of kernel functions, such as the linear kernel, polynomial kernel, or radial basis function (RBF) kernel. These kernels enable SVMs to effectively capture complex relationships and patterns in the data.

Health Diabetes dataset

First we read in the dataset, download it from here.

Rows: 2768 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (10): Id, Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, B...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We need to remove the Id column and change the outcome columns into a factor.

HealthcareDiabetes$Outcome = as.factor(HealthcareDiabetes$Outcome)
is.factor(HealthcareDiabetes$Outcome) #check

[1] TRUE

Obviously, for your own datasets, you need to do EDA, among cleaning and processing. Here we will jump straight in. It is important that when you go ahead with MLR3, you check the learner itself and the parameters the learner can take.

#lets create our task, and setting our target as Outcome
task = as_task_classif(HealthcareDiabetes,
                       target='Outcome') 
task

<TaskClassif:HealthcareDiabetes> (2768 x 10)
* Target: Outcome
* Properties: twoclass
* Features (9):
  - dbl (9): Age, BMI, BloodPressure, DiabetesPedigreeFunction,
    Glucose, Id, Insulin, Pregnancies, SkinThickness

#you hopefully noticed that the 'id' variable is considered a variable...
#this is not good and needs to be dropped
HealthcareDiabetes = select(HealthcareDiabetes,
                            -Id)

#lets set the task yp again and check...
task = as_task_classif(HealthcareDiabetes,
                       target='Outcome') 
task #fine now

<TaskClassif:HealthcareDiabetes> (2768 x 9)
* Target: Outcome
* Properties: twoclass
* Features (8):
  - dbl (8): Age, BMI, BloodPressure, DiabetesPedigreeFunction,
    Glucose, Insulin, Pregnancies, SkinThickness

set.seed(127) #critical, never forget to do this
#as if you dont, each time you run this, you'll get diff train/test splits
train_set = sample(task$row_ids, 0.7 * task$nrow)
test_set = setdiff(task$row_ids, train_set)

#lets choose our learner - svm
learner = lrn("classif.svm",
              predict_type = "prob", #provides probabilities if you add 'response' it'll give labels
              kernel="linear")
learner

<LearnerClassifSVM:classif.svm>
* Model: -
* Parameters: kernel=linear
* Packages: mlr3, mlr3learners, e1071
* Predict Types:  response, [prob]
* Feature Types: logical, integer, numeric
* Properties: multiclass, twoclass

# training
learner$train(task, row_ids = train_set)

#lets use the model to predict (training set!!)
pred_train = learner$predict(task, row_ids=train_set) # predicting 
pred_train$confusion #print the conf matrix

        truth
response    0    1
       0 1116  291
       1  136  394

measures = msrs(c('classif.acc', 'classif.ce')) #accuracy and class error
pred_train$score(measures) #print the scores

classif.acc  classif.ce 
   0.779556    0.220444

# use the model to predict on the TEST dataset 
pred_test = learner$predict(task, row_ids=test_set) #predicting
pred_test$confusion #get the confusion matrix

        truth
response   0   1
       0 504 118
       1  60 149

measures = msrs(c('classif.acc', 'classif.ce')) #acc and error
pred_test$score(measures) #print

classif.acc  classif.ce 
  0.7858002   0.2141998

Activity

where are the misclassifications happening?
consider in this context the difference between a FP and a FN - what is worse? [no wrong answers]
look up the different supported kernel functions on mlr3 for this svm, and try out different kernels… do any improve the performance?

Naive Bayes

Naive Bayes classifiers are based on Bayes’ Theorm, it is worth you spending time to revise this and understand it. If you want some resources, let me know and I can dig some out for you. However, this classifier gets its name of naive due to a couple underlying assumptions it takes:

It assumes that predictors in a Naïve Bayes model are conditionally independent, or unrelated to any of the other feature in the model.
It also assumes that all features contribute equally to the outcome.

While these assumptions are often violated in real-world scenarios (e.g. a subsequent word in an e-mail is dependent upon the word that precedes it), it simplifies a classification problem by making it more computationally tractable. Despite this unrealistic independence assumption, the classification algorithm performs well, particularly with small sample sizes. It also can handle multi-class problems too, unlike an SVM in its usual form. Let’s use the healthcare dataset again and look at this.

As always, as you get more into mlr3, check the documentation for each learner so you can see the associated parameters you can set. Naive Bayes in comparison to others, has very few!

#lets create our task, and setting our target as Outcome
task = as_task_classif(HealthcareDiabetes,
                       target='Outcome') 
task #again always check that the right variables are in there!

<TaskClassif:HealthcareDiabetes> (2768 x 9)
* Target: Outcome
* Properties: twoclass
* Features (8):
  - dbl (8): Age, BMI, BloodPressure, DiabetesPedigreeFunction,
    Glucose, Insulin, Pregnancies, SkinThickness

set.seed(127) #critical, never forget to do this
#as if you dont, each time you run this, you'll get diff train/test splits
train_set = sample(task$row_ids, 0.7 * task$nrow)
test_set = setdiff(task$row_ids, train_set)

#lets choose our learner - NB 
learner = lrn("classif.naive_bayes",
              predict_type = "prob") #provides probabilities if you add 'response' it'll give labels
          
learner

<LearnerClassifNaiveBayes:classif.naive_bayes>
* Model: -
* Parameters: list()
* Packages: mlr3, mlr3learners, e1071
* Predict Types:  response, [prob]
* Feature Types: logical, integer, numeric, factor
* Properties: multiclass, twoclass

# training
learner$train(task, row_ids = train_set)

#lets use the model to predict (training set!!)
pred_train = learner$predict(task, row_ids=train_set) # predicting 
pred_train$confusion #print the conf matrix

        truth
response    0    1
       0 1042  265
       1  210  420

measures = msrs(c('classif.acc', 'classif.ce')) #accuracy and class error
pred_train$score(measures) #print the scores

classif.acc  classif.ce 
  0.7547754   0.2452246

# use the model to predict on the TEST dataset 
pred_test = learner$predict(task, row_ids=test_set) #predicting
pred_test$confusion #get the confusion matrix

        truth
response   0   1
       0 456 100
       1 108 167

measures = msrs(c('classif.acc', 'classif.ce')) #acc and error
pred_test$score(measures) #print

classif.acc  classif.ce 
  0.7496992   0.2503008

Decision Trees

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. DTs are simple to understand and interpret, they do not need a huge amount of data prep, where they can often handle missing data, too (but be careful and check up on how this is treated, etc). Note that, DTs can easily overfit and thus not generalize to unseen data well, they can also be unstable (so we might play with parameters, e.g., max depth, min number of samples required at leaf nodes, etc).

I will remind you to look at the rpart documentation, where you will see:

there are a LOT of parameters here
you can include mixed data types into this algorithm

#lets create our task, and setting our target as Outcome
task = as_task_classif(HealthcareDiabetes,
                       target='Outcome') 
task

<TaskClassif:HealthcareDiabetes> (2768 x 9)
* Target: Outcome
* Properties: twoclass
* Features (8):
  - dbl (8): Age, BMI, BloodPressure, DiabetesPedigreeFunction,
    Glucose, Insulin, Pregnancies, SkinThickness

set.seed(127) #critical, never forget to do this
#as if you dont, each time you run this, you'll get diff train/test splits
train_set = sample(task$row_ids, 0.7 * task$nrow)
test_set = setdiff(task$row_ids, train_set)

#lets choose our learner - NB 
learner = lrn("classif.rpart",
              predict_type = "prob") #provides probabilities if you add 'response' it'll give labels
          
learner

<LearnerClassifRpart:classif.rpart>: Classification Tree
* Model: -
* Parameters: xval=0
* Packages: mlr3, rpart
* Predict Types:  response, [prob]
* Feature Types: logical, integer, numeric, factor, ordered
* Properties: importance, missings, multiclass, selected_features,
  twoclass, weights

# training
learner$train(task, row_ids = train_set)

#lets use the model to predict (training set!!)
pred_train = learner$predict(task, row_ids=train_set) # predicting 
pred_train$confusion #print the conf matrix

        truth
response    0    1
       0 1079   85
       1  173  600

measures = msrs(c('classif.acc', 'classif.ce')) #accuracy and class error
pred_train$score(measures) #print the scores

classif.acc  classif.ce 
  0.8668043   0.1331957

# use the model to predict on the TEST dataset 
pred_test = learner$predict(task, row_ids=test_set) #predicting
pred_test$confusion #get the confusion matrix

        truth
response   0   1
       0 481  50
       1  83 217

measures = msrs(c('classif.acc', 'classif.ce')) #acc and error
pred_test$score(measures) #print

classif.acc  classif.ce 
  0.8399519   0.1600481

Activity

DTs tend to perform well from the get go, as you can see here, although the performance does drop a little when we apply to the test data.

try different train/test splits and see the performance changes
print the following metric outputs from the DT on the health dataset: balanced accuracy, precision, sensitivity, specificity, and AUC
if you wanted to adapt parameters of the learner, where would you do that in your code? Try the following:
- read about the maxdepth parameter - what does it do? set several different values - does It impact the performance?
- read about the minsplit parameter - what does this do? try several values and see how it impacts performance.

Regression

As a quick recap, regression models are predicting a numerical/continuous target variable rather then a categorical one. This by nature means we also need to think about (a) how we set up our models and (b) how we evaluate them. There are a number of algorithms that can do regression analysis in machine learning, for example, Decision Trees.

Decision Tree

#lets create our task, and setting our target as Glucose
task = as_task_regr(HealthcareDiabetes,
                       target='Glucose') 
task #note: the outcome variable from our class predictions is in there

<TaskRegr:HealthcareDiabetes> (2768 x 9)
* Target: Glucose
* Properties: -
* Features (8):
  - dbl (7): Age, BMI, BloodPressure, DiabetesPedigreeFunction,
    Insulin, Pregnancies, SkinThickness
  - fct (1): Outcome

Activity

Discuss whether the outcome variable should be included? Why, why not?
Run it both ways and see what impact it has…

set.seed(127) #critical, never forget to do this
#as if you dont, each time you run this, you'll get diff train/test splits
train_set = sample(task$row_ids, 0.7 * task$nrow)
test_set = setdiff(task$row_ids, train_set)

#lets choose our learner - NB 
learner = lrn("regr.rpart") 
learner

<LearnerRegrRpart:regr.rpart>: Regression Tree
* Model: -
* Parameters: xval=0
* Packages: mlr3, rpart
* Predict Types:  [response]
* Feature Types: logical, integer, numeric, factor, ordered
* Properties: importance, missings, selected_features, weights

# training
learner$train(task, row_ids = train_set)

#lets use the model to predict (training set!!)
pred_train = learner$predict(task, row_ids=train_set) # predicting 

# you of course cannot use a confusion matrix when predicting a numeric outcome...
#when you run it, it returns NULL 
pred_train$confusion

NULL

#we have to evaluate this differently!

Regression Supervised Learning Evaluation Metrics

In comparison to classification problems, where there are lots of ways to demonstrate model performance, it is somewhat harder to do with regression models. Here, it is less about predicting the exact value but rather more about seeing how close your prediction is to the real value.There are some approaches/loss functions for evaluating how accurately your ML model is predicting. The loss function will take two items as input: the output value of our model and the ground truth expected value. The output of the loss function is called the loss which is a measure of how well our model did at predicting the outcome. Note: there are many other loss functions, we are just talking about a few.

R-Squared: R-squared is a statistical measure that represents the goodness of fit of a regression model. The value of R-square lies between 0 to 1. Where we get R-square equals 1 when the model perfectly fits the data and there is no difference between the predicted value and actual value. However, we get R-square equals 0 when the model does not predict any variability in the model and it does not learn any relationship between the dependent and independent variables. One major issue with R-squared is that its value will always go up if we keep adding more and more variables into the model, even if they are redundant and add nothing. So… to address this, we use the Adjusted R-Squared.
Adjusted R Square: The adjusted R-squared is a modified version of R-squared that adjusts for predictors that are not significant in a regression model. So.. if you have a model with additional input variables, a lower adjusted R-squared indicates that the additional input variables are not adding value to the model. In contrast, if you have a model with additional input variables, a higher adjusted R-squared indicates that the additional input variables are adding value to the model.
Mean Square Error (MSE): this is a fairly straightforward loss function, where essentially to calculate the MSE, you take the difference between the model predictions and the actual values, square it, and take the average across the entire dataset. Note: this will never be negative, so if you managed to get a negative value, something has gone wrong!
- note: the MSE will place a higher weight on outliers/or bad predictions… so when this happens, these errors are magnified with squaring. It is also common to actually take the ROOT mean squared error, therefore the interpretation of the output is the average distance between the actual and predicted points.
Mean Absolute Error (MAE): this is veyr similar to MSE but effectively the opposite. To calculate the MAE, we take the difference between the model’s predictions and the actual values, we take the absolute value of that difference and then average this across the dataset.
- the MAE helps sort out the weighting issue of the MSE, where here since we take the absolute value, all of the errors are weighted on the same linear scale, hence any outliers or bad predictions won’t be as effected.

We’ll test out some of these. Note that MLR3 has lots of measures for regression machine learning: Examples of MLR3 Regression Measures

library('mlr3measures', quietly = TRUE, warn.conflicts = FALSE) # we need to install the measures mlr package

In order to avoid name clashes, do not attach 'mlr3measures'. Instead, only load the namespace with `requireNamespace("mlrmeasures")` and access the measures directly via `::`, e.g. `mlr3measures::auc()`.

measures = msrs(c('regr.mse', 'regr.mae')) #MSE and MAE
pred_train$score(measures) #print the scores

 regr.mse  regr.mae 
678.50623  19.61002

# use the model to predict on the TEST dataset 
pred_test = learner$predict(task, row_ids=test_set) #predicting

measures = msrs(c('regr.mse', 'regr.mae')) #MSE and MAE
pred_test$score(measures) #print

 regr.mse  regr.mae 
605.84977  18.86629

Supervised Learning

Supervised Learning

Introduction to Supervised Learning

So… in supervised ML… how do machines learn?

Machine Learning Pipeline

Types of Supervised ML

Classification

A couple notes

Evaluation Metrics for Classification

Confusion Matrix

Additional Measures from your Confusion Matrix…

Balanced Accuracy

Area Under the Curve - Receiver Operating Characteristics (AUC-ROC)

Precision-Recall Curve (PRAUC)

k-Nearest Neighbours

Let’s try it out

Let’s look at other values of k

Activities

Machine Learning in R universe - MLR3

Let’s look at the knn algo again

Activity

Support Vector Machines

Health Diabetes dataset

Activity

Naive Bayes

Decision Trees

Activity

Regression

Decision Tree

Activity

Regression Supervised Learning Evaluation Metrics

Ensembles

Bagging

Boosting

Brief intro to Tuning