Artikelen

Azure ML Thursday 5: trained Python models

Azure / Azure ML

Azure ML Thursday 5: trained Python models

Last week, we stepped out of Azure ML to look at building ML models in Python using scikit-learn. Today, we focus on getting the trained model back into Azure ML - the place where my ML solutions live in a managed, enterprise environment.

The path of bringing a trained model from the local Python/Anaconda environment towards cloud Azure ML is globally as follows:

  1. Export the trained model
  2. Zip the exported files
  3. Upload to the Azure ML environment
  4. Embed in your Azure ML solution

Sounds simple, and it isn't too hard indeed. The things getting in the way of "just" doing it are primarily a lack of Python / scikit-learn knowledge ("how do you export a trained model in the first place?") and general lack of ML experience (remember that you need to perform all translations you did on the training data exactly the same way in production!). As soon as you've learned how to tackle the first hurdle and seen the trick of importing models inside Azure ML Studio, hardly anything is holding you back to deploy your locally-developed masterpieces to production.

Step 1: Export the trained model

Remember that your trained model in Python is stored in "just" another variable - just as you're used to in any (almost) object oriented language. Python can export the content of any variable using a process called pickling[ref]For non-native speakers: that's a verb - to pickle. Gotta be a pun to pickling herring for conservation[/ref]. When you pickle an object, the bytes currently in memory representing the object are dumped to (and can be loaded again from) a file.

It's actually quite easy:

import pickle

#Export variable my_ml_model to the file ExportedFileName.p
pickle.dump( my_ml_model, open( "ExportedFileName.p", "wb" ) )

#Import the file ExportedFileName.p, store results in variable my_ml_model
my_ml_model = pickle.load( open( "ExportedFileName.p", "rb" ) )

For scikit-learn, it's recommended you use the joblib replacement of pickle[ref]joblib is basically more efficient in saving most numpy-matrices[/ref]. It's not necessary to use joblib (you can also use pickle), but it's more efficient. Plus, it's even easier to write: you don't have to worry about file-opening modes like the "wb" and "rb" above.

from sklearn.externals import joblib

#Export variable my_ml_model to file ExportedFileName.p
joblib.dump(my_ml_model, 'ExportedFileName.p')

joblib dump essential

For large objects, joblib often saves the contents in multiple files, whose filenames will be appended with _(counter).npy.

joblib results in multiple files

You must keep all files representing a single object together in one folder when loading it, but you don't have to interact with any of the '.npy' files: you only interact with the file you saved explicitly.

from sklearn.externals import joblib

#Import the file ExportedFileName.npy, store contents in variable my_ml_model
#Even if the object is stored in numerous ".npy" files, we still only point to the filename we used when saving: ExportedFileName.p in this case
my_ml_model = joblib.load('ExportedFileName.p')

 

Step 2: Zip the exported files

In order to use pickled objects inside Azure ML's  Execute Python Script module, we need to zip everything and upload it as a dataset. Inside the zip file, all pickled objects should be in the root.

zipfile

Besides pickled object, you can include Python scripts in the zip file too. For example, you could add a Python script that unpickles the objects you need for a particular ML model so you don't have to remember the syntax and exact paths where Azure ML stores the contents. These other scripts can be consumed easily by the Execute Python Script, as I'll show in step 4.

from sklearn.externals import joblib
import sys
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

sys.path.append(r".\\Script Bundle\\") #Azure unzips the zip contents to the folder "Script Bundle"
trained_model = joblib.load('.\\Script Bundle\\MyExportedFile.p')

def predict(data):
  #I usually include a data preparation function too,  omitted here for conciseness
  label = trained_model.predict(data)
  return label

 

Step 3: Upload to the Azure ML environment

Azure ML has no way to upload "just" libraries - all files are treated equally. The zip file should uploaded as a dataset:

UploadZipWorkflow

Step 4: Embed in Azure ML experiment

With the zipped file available as a "dataset", we can embed it inside an experiment. Inside your experiment, the zip we just uploaded is available under My Datasets. In order to use it, throw it onto the canvas and connect it to the right (as opposed to center or left) input port of an Execute Python Script module:

useuploadeddataset

When running the experiment, Azure ML Studio extracts the files inside the zip dataset to the folder "Script Bundle". From within Python you can access the files via that relative path:

sys.path.append(r".\\Script Bundle\\")
trained_model = joblib.load('.\\Script Bundle\\MyExportedFile.p')

As described under step 2, you could also include helper scripts. To use helper scripts, you don't have to memorize the path: you can just import the script using Python's import function:

import pandas as pd

#azureml_main is required and can have two inputs, representing the two most left input ports of the Execute Python Script module
def azureml_main(data_frame): 
    import my_helper_script #loads my_helper_script.py
    data_frame["Scored Labels"] = my_helper_script.predict(data_frame)
    return data_frame

Through the use of a helper script, the amount of code inside the Execute Python Module is kept to a minimum - which makes your datasets more portable and easier to maintain.

One More Thing: Including all transformations

In order to repeat all transformations you did on the training set in production, it's important to export not only the trained ML model: all transformations need to be exported too. When using last week's sample code, there are four objects to be exported (Imputer, Religion-mapper, One-hot encoder and the actual trained model).

If you use just the transformers from within scikit-learn, you could make your life a lot easier by using pipelines. Check out the pipeline documentation in the scikit-learn docs as well as an example of a pipeline constructed from one Imputer and one RandomForestRegressor if you want to know how. Don't worry - you'll find it's pretty easy :).

Conclusion

Last week, I showed you a brief summary of using Python with scikit-learn to train your ML models. This enhances your possibilities of applying ML techniques vastly. Today was the follow-up: how to use your trained Python ML models again within Azure ML.

With today's knowledge, it's perfectly doable to participate in an Azure ML competition using your enhanced Python ML models, or with the help of your personal data scientist port his ingenious models towards the managed Azure ML environment.

Comments (4)

  1. Can you tell me what version of Scikit-Learn you used? AzureML currently runs version 15.1 which makes it difficult to use serialised models from later versions of Scikit-Learn on AzureML.

    1. What problems do you run into?

      If I remember it correctly I used the version that came with Anaconda 4.1.1 (64-bit) - which is version 0.17.1. However, I might have used Anaconda 4.0 in order to remain compatible (Anaconda 4.0 is the one used on Azure)..

  2. ameen

    how solve this problem
    Error 0085: The following error occurred during script evaluation, please view the output log for more information:
    ---------- Start of error message from Python interpreter ----------
    Caught exception while executing function: Traceback (most recent call last):
    File "C:\pyhome\lib\pickle.py", line 268, in _getattribute
    obj = getattr(obj, subpath)
    AttributeError: module 'sklearn.externals.joblib.numpy_pickle' has no attribute 'NumpyArrayWrapper'
    During handling of the above exception, another exception occurred:
    Traceback (most recent call last):
    File "C:\server\invokepy.py", line 199, in batch
    odfs = mod.azureml_main(*idfs)
    File "C:\temp\23b0a8ea22e745bdbc79961d5cf1d10a.py", line 22, in azureml_main
    import helper #loads my_helper_script.py
    File ".\Script Bundle\helper.py", line 10, in
    trained_model = joblib.load('.\\Script Bundle\\clfGBR.p')
    File "C:\pyhome\lib\site-packages\sklearn\externals\joblib\numpy_pickle.py", line 459, in load
    obj = unpickler.load()
    File "C:\pyhome\lib\pickle.py", line 1039, in load
    dispatch[key[0]](self)
    File "C:\pyhome\lib\pickle.py", line 1343, in load_stack_global
    self.append(self.find_class(module, name))
    File "C:\pyhome\lib\pickle.py", line 1386, in find_class
    return _getattribute(sys.modules[module], name)[0]
    File "C:\pyhome\lib\pickle.py", line 271, in _getattribute
    .format(name, obj))
    AttributeError: Can't get attribute 'NumpyArrayWrapper' on
    Process returned with non-zero exit code 1

    ---------- End of error message from Python interpreter ----------
    Start time: UTC 10/31/2017 12:54:11
    End time: UTC 10/31/2017 12:54:28

    1. Sadiq

      Hey ameen,

      Did you ever figure out this error?

Comments are closed.