Azure ML Thursday 6: xgboost in R

Azure / Azure ML / Uncategorized

Azure ML Thursday 6: xgboost in R

Last Azure ML Thursdays we explored how to do our Machine Learning in Python. Python in Azure ML doesn't include one particularly succesful algorithm though - xgboost. Python packages are available, but just not yet for Windows - which means also not inside Azure ML Studio. But they are available inside R! Today, we take the same approach as two weeks ago: first, we move out of Azure ML to do our first ML in R, then (next week) we'll upload and use our trained R model inside Azure ML studio.

Today, I'll show you how use xgboost on the still ongoing Cortana Intelligence Competition "Women's Health Risk Assessment" (WHRA). At the moment of writing, the leaderboard stayed the same for over three weeks, with only 336 participants - but ending in a week, with a grand prize of $3,000.

So rush to participate, and use the knowledge shared here to win - all code presented below can be run in order and will result in a trained model for the WHRA dataset!

ICYMI: What is R?

R is a statistical language widely used by academic and data scientists alike. R is open source, very powerful, reminds some people of Matlab[ref]and even more of S[/ref] and having originated in statistics, R has a pretty solid collection of Machine Learning libraries. One of the great advantages of R is that it's being extensively used by a lot of researchers in the field of statistics and Machine Learning, which means the newest, best-performing algorithms will pretty much always be available in R. Its open source and organic character made enterprises initially somewhat hesitant to start using it, but currently more and more large vendors are backing it with enterprise-grade support.

When you're getting started using R, I highly recommend to download and use R Studio[ref]I know Microsoft has R tools too inside Visual Studio, but I haven't used them yet. What I do like about R Studio is that all the pros - and not just the Microsoft platform pros - use it, blog about it, and are willing to help if you get stuck.[/ref]. You can download R Studio on

Machine Learning inside R

The steps of doing Machine Learning are not very different from the steps we've taken earlier in Python: it's still about transformations to the dataset, splitting into train / testsets, training the model using parameters and scoring the model by testing it.

Prerequisites: the libraries

We'll use four libraries while working in our local R environment:


If one or more libraries are missing, add them one by one using install.packages:


Loading the dataset

To load the dataset, we use read.csv  - here's the Women Health training set of the (still ongoing) Cortana Intelligence Competition , we use read.csv.

After that, we combine the three to-be-predicted columns inside one column (for a detailed description of the dataset see this document) and remove the columns with which we would be able to identify a patient (read my earlier post about overfitting if you wonder why):

dataset1 <- read.csv("../Documents/WomenHealth_Training.csv")

#Create one combined label for prediction purposes
combined_label <- 100*dataset1$geo + 10*dataset1$segment + dataset1$subgroup
dataset1 <- cbind(dataset1, combined_label)

#Get rid of identification column(s)
dataset1["patientID"] = NULL
dataset1["INTNR"] = NULL

One-hot encoding

In my earlier post "Azure ML Thursday 4: ML in Python", I've explained what One-Hot Encoding is (and why you need it). Inside R, usually you don't use the "one-hot encoder" as in sklearn. Instead, you can use the acm.disjonctif  to create dummies:

ohe_feats = c('religion')
for (f in ohe_feats){
  dataset1_dummy = acm.disjonctif(dataset1[f])
  dataset1[f] = NULL
  dataset1 = cbind(dataset1, dataset1_dummy)

Basically, the 'dummies' method does the same as a One-Hot Encoder in sklearn, with one exception: the One-Hot Encoder inside sklearn separates "fit" from "transform", which means it can re-apply the same transformation to new datasets. The "dummies" method on the other hand uses the contents of the columns to encode column names. This can introduce new columns when new values appear in production data that weren't present in training data, so you might need to prepare for that. In the WHRA example this isn't a problem though - the only column we'll remove here is the empty religion:

dataset1["religion."] = NULL

Splitting the data sets

The easiest way to split dataset is the createDataPartition  function from the caret  library. The p states how much data is reserved for training here:

train.index <- createDataPartition(dataset1_cleared$combined_label, p = .75, list = FALSE)

trainset <- dataset1_cleared[ train.index,]
testset  <-dataset1_cleared[-train.index,]

train <- as.matrix(trainset)
test <- as.matrix(testset)

The caret library is not available in Azure ML Studio, but that's no problem - as the training is done on-premises, we don't need to split train- and testdata in Azure ML Studio.

Prepare labels for xgboost

xgboost expects labels and features in separate sets, so we should split X from Y, and clear the labels in X:

#Create label sets, separate X from Y
train.label <- as.matrix(trainset["combined_label"])
test.label <- as.matrix(testset["combined_label"])

trainset["combined_label"] = NULL
testset["combined_label"] = NULL

trainset["segment"] = 0
testset["segment"] = 0

trainset["subgroup"] = 0
testset["subgroup"] = 0

xgboost also expects the labels to be a zero-based numeric. It's not too hard to achieve that, but keep in mind you should be able to translate the predicted values to the corresponding classes!

#Translate labels to numeric (0-based) values, as xgboost expects:
nlabels.train <- as.numeric(as.factor(train.label))-1
nlabels.test <- as.numeric(as.factor(test.label))-1

#Store the factor levels to translate predictions to the corresponding classes:
labels.factorized = levels(as.factor(train.label))


Train the XgBoost model

After having prepared the dataset, we can now train xgboost. It's pretty easy:

#Convert trainingsets to matrices
dtrain <- xgb.DMatrix(train, label=nlabels.train, missing=NA)
dtest <- xgb.DMatrix(test, missing=NA)

#Train xgboost model
trained_model <- xgboost(data = dtrain, max.depth = 25, eta = 0.1, nthread = 8, nround = 3, objective = "multi:softmax", num_class=37)

Note that the parameters of xgboost used here fall in three categories:

  • General parameters
    • nthread (number of threads used, here 8 = the number of cores in my laptop)
  • Booster parameters
    • max.depth (of tree)
    • eta
  • Learning task parameters
    • objective: type of learning task (softmax for multiclass classification)
    • num_class: needed for the "softmax" algorithm: how many classes to predict?
  • Command Line Parameters
    • nround: number of rounds for boosting

For a complete overview of parameters see

Predictions using the trained xgboost model

To predict new cases using the just-trained xgboost model, use the function predict . In the example below, I've included three lines to test the performance too:

#Predict test set using the trained xgboost model:
pred <- predict(trained_model, dtest)

#Performance of the model:
n_correct <- sum(nlabels.test == pred)
n_err <- sum(nlabels.test != pred)
n_correct / (n_err + n_correct)


As you see, it's not too hard to use the winning xgboost algorithm inside R. All code presented above can be executed in order, and will result in a working predictive model for the ongoing Womens Health Risk Assessment (WHRA) challenge! Next week (just before the deadline) I'll show you how to import the model inside Azure ML Studio, but if you want to do it earlier, I'm pretty sure you can figure out how to do it yourself (or just ask me to share it with you).


Comments (2)

  1. RedTemp

    Thanks for the do you actually import the model inside Azure ML Studio?

    1. That's explained in this post: Note that it can be a bit outdated, as it's written in 2016 (I didn't work with AzureML recently, but if you're new to ML on Azure I strongly suggest you also take a look at Machine Learning Workbench)

Comments are closed.