Azure ML Thursday 6: xgboost in R
22 september 2016 2021-10-13 8:50Azure ML Thursday 6: xgboost in R
Last Azure ML Thursdays we explored how to do our Machine Learning in Python. Python in Azure ML doesn't include one particularly succesful algorithm though - xgboost. Python packages are available, but just not yet for Windows - which means also not inside Azure ML Studio. But they are available inside R! Today, we take the same approach as two weeks ago: first, we move out of Azure ML to do our first ML in R, then (next week) we'll upload and use our trained R model inside Azure ML studio.
Today, I'll show you how use xgboost on the still ongoing Cortana Intelligence Competition "Women's Health Risk Assessment" (WHRA). At the moment of writing, the leaderboard stayed the same for over three weeks, with only 336 participants - but ending in a week, with a grand prize of $3,000.
So rush to participate, and use the knowledge shared here to win - all code presented below can be run in order and will result in a trained model for the WHRA dataset!
ICYMI: What is R?
R is a statistical language widely used by academic and data scientists alike. R is open source, very powerful, reminds some people of Matlab[ref]and even more of S[/ref] and having originated in statistics, R has a pretty solid collection of Machine Learning libraries. One of the great advantages of R is that it's being extensively used by a lot of researchers in the field of statistics and Machine Learning, which means the newest, best-performing algorithms will pretty much always be available in R. Its open source and organic character made enterprises initially somewhat hesitant to start using it, but currently more and more large vendors are backing it with enterprise-grade support.
When you're getting started using R, I highly recommend to download and use R Studio[ref]I know Microsoft has R tools too inside Visual Studio, but I haven't used them yet. What I do like about R Studio is that all the pros - and not just the Microsoft platform pros - use it, blog about it, and are willing to help if you get stuck.[/ref]. You can download R Studio on https://www.rstudio.com/
Machine Learning inside R
The steps of doing Machine Learning are not very different from the steps we've taken earlier in Python: it's still about transformations to the dataset, splitting into train / testsets, training the model using parameters and scoring the model by testing it.
Prerequisites: the libraries
We'll use four libraries while working in our local R environment:
library(xgboost) library(ade4) library(data.table) library(caret)
If one or more libraries are missing, add them one by one using install.packages:
install.packages("xgboost")
Loading the dataset
To load the dataset, we use read.csv - here's the Women Health training set of the (still ongoing) Cortana Intelligence Competition , we use read.csv.
After that, we combine the three to-be-predicted columns inside one column (for a detailed description of the dataset see this document) and remove the columns with which we would be able to identify a patient (read my earlier post about overfitting if you wonder why):
dataset1 <- read.csv("../Documents/WomenHealth_Training.csv") #Create one combined label for prediction purposes combined_label <- 100*dataset1$geo + 10*dataset1$segment + dataset1$subgroup dataset1 <- cbind(dataset1, combined_label) #Get rid of identification column(s) dataset1["patientID"] = NULL dataset1["INTNR"] = NULL
One-hot encoding
In my earlier post "Azure ML Thursday 4: ML in Python", I've explained what One-Hot Encoding is (and why you need it). Inside R, usually you don't use the "one-hot encoder" as in sklearn. Instead, you can use the acm.disjonctif to create dummies:
ohe_feats = c('religion') for (f in ohe_feats){ dataset1_dummy = acm.disjonctif(dataset1[f]) dataset1[f] = NULL dataset1 = cbind(dataset1, dataset1_dummy) }
Basically, the 'dummies' method does the same as a One-Hot Encoder in sklearn, with one exception: the One-Hot Encoder inside sklearn separates "fit" from "transform", which means it can re-apply the same transformation to new datasets. The "dummies" method on the other hand uses the contents of the columns to encode column names. This can introduce new columns when new values appear in production data that weren't present in training data, so you might need to prepare for that. In the WHRA example this isn't a problem though - the only column we'll remove here is the empty religion:
dataset1["religion."] = NULL
Splitting the data sets
The easiest way to split dataset is the createDataPartition function from the caret library. The p states how much data is reserved for training here:
train.index <- createDataPartition(dataset1_cleared$combined_label, p = .75, list = FALSE) trainset <- dataset1_cleared[ train.index,] testset <-dataset1_cleared[-train.index,] train <- as.matrix(trainset) test <- as.matrix(testset)
The caret library is not available in Azure ML Studio, but that's no problem - as the training is done on-premises, we don't need to split train- and testdata in Azure ML Studio.
Prepare labels for xgboost
xgboost expects labels and features in separate sets, so we should split X from Y, and clear the labels in X:
#Create label sets, separate X from Y train.label <- as.matrix(trainset["combined_label"]) test.label <- as.matrix(testset["combined_label"]) trainset["combined_label"] = NULL testset["combined_label"] = NULL trainset["segment"] = 0 testset["segment"] = 0 trainset["subgroup"] = 0 testset["subgroup"] = 0
xgboost also expects the labels to be a zero-based numeric. It's not too hard to achieve that, but keep in mind you should be able to translate the predicted values to the corresponding classes!
#Translate labels to numeric (0-based) values, as xgboost expects: nlabels.train <- as.numeric(as.factor(train.label))-1 nlabels.test <- as.numeric(as.factor(test.label))-1 #Store the factor levels to translate predictions to the corresponding classes: labels.factorized = levels(as.factor(train.label))
Train the XgBoost model
After having prepared the dataset, we can now train xgboost. It's pretty easy:
#Convert trainingsets to matrices dtrain <- xgb.DMatrix(train, label=nlabels.train, missing=NA) dtest <- xgb.DMatrix(test, missing=NA) #Train xgboost model trained_model <- xgboost(data = dtrain, max.depth = 25, eta = 0.1, nthread = 8, nround = 3, objective = "multi:softmax", num_class=37)
Note that the parameters of xgboost used here fall in three categories:
- General parameters
- nthread (number of threads used, here 8 = the number of cores in my laptop)
- Booster parameters
- max.depth (of tree)
- eta
- Learning task parameters
- objective: type of learning task (softmax for multiclass classification)
- num_class: needed for the "softmax" algorithm: how many classes to predict?
- Command Line Parameters
- nround: number of rounds for boosting
For a complete overview of parameters see https://github.com/dmlc/xgboost/blob/master/doc/parameter.md.
Predictions using the trained xgboost model
To predict new cases using the just-trained xgboost model, use the function predict . In the example below, I've included three lines to test the performance too:
#Predict test set using the trained xgboost model: pred <- predict(trained_model, dtest) #Performance of the model: n_correct <- sum(nlabels.test == pred) n_err <- sum(nlabels.test != pred) n_correct / (n_err + n_correct)
Conclusion
As you see, it's not too hard to use the winning xgboost algorithm inside R. All code presented above can be executed in order, and will result in a working predictive model for the ongoing Womens Health Risk Assessment (WHRA) challenge! Next week (just before the deadline) I'll show you how to import the model inside Azure ML Studio, but if you want to do it earlier, I'm pretty sure you can figure out how to do it yourself (or just ask me to share it with you).
Comments (2)
RedTemp
Thanks for the post...how do you actually import the model inside Azure ML Studio?
Koos van Strien
That's explained in this post: http://www.msbiblog.com/2016/09/29/azure-ml-thursday-7-xgboost-in-azure-ml-studio/. Note that it can be a bit outdated, as it's written in 2016 (I didn't work with AzureML recently, but if you're new to ML on Azure I strongly suggest you also take a look at Machine Learning Workbench)
Comments are closed.