Artikelen

Azure ML Thursday 7: xgboost in Azure ML Studio

Azure / Azure ML

Azure ML Thursday 7: xgboost in Azure ML Studio

Last week, we trained an xgboost model for our dataset inside R. In order to use your trained dataset in Azure ML, you need to export & upload it much like we did two weeks ago in Python. Today, I'll show how to import the trained R model into Azure ML studio, thus enabling you to use xgboost in Azure ML studio. If you combine last week's knowledge of using xgboost with today's knowledge of importing trained xgboost models inside Azure ML Studio, it's not too hard to climb the leaderboards of the (still ongoing) WHRA challenge!

Because the high-level path of bringing trained R models from the local R environment towards the cloud Azure ML is almost identical to the Python one I showed two weeks ago, I use the same four steps to guide you through the process:

  1. Export the trained model
  2. Zip the exported files
  3. Upload to the Azure ML environment
  4. Embed in your Azure ML solution

Step 1: Export the trained model

In order to export the trained xgboost model, you can use the method xgb.save. When the trained xgboost model is stored inside the R variable trained_model  (like we did last week), the following line will store the model in the file xgb_test.model inside the current working directory:

xgb.save(trained_model, 'xgb_test.model')

Remember that you need to re-apply all transformations on the production dataset too! In last week's post we used factorized labels in order to train the model. Therefor, I'll store the vector describing the factorization too using R's save  method:

labels.factorized = levels(as.factor(train.label)) # Last week's factorization for labels
save(labels.factorized, file="labels_factorized.RData")

Step 2: Zip the exported files

In Python, we needed to zip the files before uploading them. Inside R, we do not necessarily need to do so - Azure ML can "natively" handle exported R datasets. The zip does provide a clean way of handling (and updating) a set of bundled variables, so I still choose to do so. You can even include helper scripts inside the zip.

Step 3: Upload to the Azure ML environment

Upload the zipped file as a dataset to Azure ML Studio:

UploadZipWorkflow

Step 4: Embed in your Azure ML solution

With the zipped file available as a "dataset", we can embed it inside an experiment. Inside your experiment, the zip we just uploaded is available under My Datasets. In order to use it, throw it onto the canvas and connect it to the right (as opposed to center or left) input port of an Execute R Script module[ref]Hey, that sentence is a literal copy out of the Python post two weeks ago![/ref]:

r-uploaded-file

Notice you need to set the R Version (property of Execute R Script) to Microsoft R Open (at the time of writing 3.2.2) in order to use xgboost!! 

ropen322

After that, you can access the files inside the zip using the regular R and xgboost functions for loading datafiles. Just remember that the contents of the zipped file are inside the "src" directory:

library(xgboost)

#To load the labels.factorized variable from the RData file:
load("src/labels_factorized.RData")

#To load the xgboost trained model:
trained_model <- xgb.load("src/xgb_test.model");

If you're not fluent in R, notice that R's load  function assigns variables for you - as opposed to Python, where the contents of joblib.load  need to be assigned to a variable, the contents are now automatically exposed under the original variable name - in our case labels_factorized  - and can be used as such:

predicted_class <- as.numeric(levels(labels.factorized)[as.numeric(predicted_class) + 1]) - 1

One More Thing: trained R models in Azure ML

Azure ML Studio also has embedded functionality to use trained R models just like you'd use the native Azure ML models. Although this can lead to a much cleaner way of integrating R models inside your Azure ML environment, currently the only supported version is CRAN R 3.1.0 - which means that you cannot use xgboost (yet).

If you try to use xgboost with CRAN R 3.1.0 as the R version, you'll end up with the following error:

Error 0063: The following error occurred during evaluation of R script:
---------- Start of error message from R ----------
there is no package called 'xgboost'


there is no package called 'xgboost'
----------- End of error message from R -----------

Conclusion + code listing

I just showed you how to embed your offline-built R xgboost model in Azure ML Studio. As you can see and deduce from the length of the post, it is actually very easy to do so. Personally, I've way more experience with Python than I have with R - still, working with R already feels more natural, clean and easy when building ML models. I'll close off with the code I use inside the Execute R Module:

# Use Microsoft R Open 3.2.2 for R Version module parameter to get xgboost library

dataset1 <- maml.mapInputPort(1)

# zip file is in directory src:
# source("src/yourfile.R");
# load("src/yourData.rdata");
# xgb.load("src/yourmodel.model");

library(ROCR)
library(xgboost)
library(Matrix)
library(ade4)
library(data.table)

ohe_feats = c('religion')
for (f in ohe_feats) {
  dataset1_dummy = acm.disjonctif(dataset1[f])
  dataset1[f] = NULL
  dataset1 = cbind(dataset1, dataset1_dummy)
}
patientIDs = dataset1["patientID"]
dataset1["patientID"] = NULL
dataset1_cleared = dataset1
dataset1_cleared["combined_label"] = NULL
dataset1_cleared["segment"] = 0
dataset1_cleared["subgroup"] = 0
dataset1_cleared["INTNR"] = NULL
test <- as.matrix(dataset1_cleared)

dtest <- xgb.DMatrix(test, missing=NA)

# Load model from file
trained_model <- xgb.load("src/xgb_test.model"); 
scores <- predict(trained_model, dtest)

# Make predictions
predicted_class <- round(scores)

#Translate back to original Labels
load("src/labels_factorized.RData")
predicted_class <- as.numeric(levels(labels.factorized)[as.numeric(predicted_class) + 1]) - 1


data.set <- data.frame(
  "patientID"=patientIDs, 
  "Geo_Pred"=as.numeric(substring(predicted_class, 1, 1)),
  "Segment_Pred"=as.numeric(substring(predicted_class, 2, 2)), 
  "Subgroup_Pred"=as.numeric(substring(predicted_class, 3, 3)))
maml.mapOutputPort("data.set");

Combine last week's knowledge of using xgboost with today's knowledge of using it inside Azure ML Studio, it's not too hard to climb the leaderboards of the (still ongoing) WHRA challenge!