Azure ML Thursday 5: trained Python models15 september 2016 2016-09-15 21:03
Azure ML Thursday 5: trained Python models
Last week, we stepped out of Azure ML to look at building ML models in Python using scikit-learn. Today, we focus on getting the trained model back into Azure ML - the place where my ML solutions live in a managed, enterprise environment.
The path of bringing a trained model from the local Python/Anaconda environment towards cloud Azure ML is globally as follows:
- Export the trained model
- Zip the exported files
- Upload to the Azure ML environment
- Embed in your Azure ML solution
Sounds simple, and it isn't too hard indeed. The things getting in the way of "just" doing it are primarily a lack of Python / scikit-learn knowledge ("how do you export a trained model in the first place?") and general lack of ML experience (remember that you need to perform all translations you did on the training data exactly the same way in production!). As soon as you've learned how to tackle the first hurdle and seen the trick of importing models inside Azure ML Studio, hardly anything is holding you back to deploy your locally-developed masterpieces to production.
Step 1: Export the trained model
Remember that your trained model in Python is stored in "just" another variable - just as you're used to in any (almost) object oriented language. Python can export the content of any variable using a process called pickling[ref]For non-native speakers: that's a verb - to pickle. Gotta be a pun to pickling herring for conservation[/ref]. When you pickle an object, the bytes currently in memory representing the object are dumped to (and can be loaded again from) a file.
It's actually quite easy:
import pickle #Export variable my_ml_model to the file ExportedFileName.p pickle.dump( my_ml_model, open( "ExportedFileName.p", "wb" ) ) #Import the file ExportedFileName.p, store results in variable my_ml_model my_ml_model = pickle.load( open( "ExportedFileName.p", "rb" ) )
For scikit-learn, it's recommended you use the joblib replacement of pickle[ref]joblib is basically more efficient in saving most numpy-matrices[/ref]. It's not necessary to use joblib (you can also use pickle), but it's more efficient. Plus, it's even easier to write: you don't have to worry about file-opening modes like the "wb" and "rb" above.
from sklearn.externals import joblib #Export variable my_ml_model to file ExportedFileName.p joblib.dump(my_ml_model, 'ExportedFileName.p')
For large objects, joblib often saves the contents in multiple files, whose filenames will be appended with _(counter).npy.
You must keep all files representing a single object together in one folder when loading it, but you don't have to interact with any of the '.npy' files: you only interact with the file you saved explicitly.
from sklearn.externals import joblib #Import the file ExportedFileName.npy, store contents in variable my_ml_model #Even if the object is stored in numerous ".npy" files, we still only point to the filename we used when saving: ExportedFileName.p in this case my_ml_model = joblib.load('ExportedFileName.p')
Step 2: Zip the exported files
In order to use pickled objects inside Azure ML's Execute Python Script module, we need to zip everything and upload it as a dataset. Inside the zip file, all pickled objects should be in the root.
Besides pickled object, you can include Python scripts in the zip file too. For example, you could add a Python script that unpickles the objects you need for a particular ML model so you don't have to remember the syntax and exact paths where Azure ML stores the contents. These other scripts can be consumed easily by the Execute Python Script, as I'll show in step 4.
from sklearn.externals import joblib import sys from sklearn.preprocessing import Imputer from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import LabelEncoder sys.path.append(r".\\Script Bundle\\") #Azure unzips the zip contents to the folder "Script Bundle" trained_model = joblib.load('.\\Script Bundle\\MyExportedFile.p') def predict(data): #I usually include a data preparation function too, omitted here for conciseness label = trained_model.predict(data) return label
Step 3: Upload to the Azure ML environment
Azure ML has no way to upload "just" libraries - all files are treated equally. The zip file should uploaded as a dataset:
Step 4: Embed in Azure ML experiment
With the zipped file available as a "dataset", we can embed it inside an experiment. Inside your experiment, the zip we just uploaded is available under My Datasets. In order to use it, throw it onto the canvas and connect it to the right (as opposed to center or left) input port of an Execute Python Script module:
When running the experiment, Azure ML Studio extracts the files inside the zip dataset to the folder "Script Bundle". From within Python you can access the files via that relative path:
sys.path.append(r".\\Script Bundle\\") trained_model = joblib.load('.\\Script Bundle\\MyExportedFile.p')
As described under step 2, you could also include helper scripts. To use helper scripts, you don't have to memorize the path: you can just import the script using Python's import function:
import pandas as pd #azureml_main is required and can have two inputs, representing the two most left input ports of the Execute Python Script module def azureml_main(data_frame): import my_helper_script #loads my_helper_script.py data_frame["Scored Labels"] = my_helper_script.predict(data_frame) return data_frame
Through the use of a helper script, the amount of code inside the Execute Python Module is kept to a minimum - which makes your datasets more portable and easier to maintain.
One More Thing: Including all transformations
In order to repeat all transformations you did on the training set in production, it's important to export not only the trained ML model: all transformations need to be exported too. When using last week's sample code, there are four objects to be exported (Imputer, Religion-mapper, One-hot encoder and the actual trained model).
If you use just the transformers from within scikit-learn, you could make your life a lot easier by using pipelines. Check out the pipeline documentation in the scikit-learn docs as well as an example of a pipeline constructed from one Imputer and one RandomForestRegressor if you want to know how. Don't worry - you'll find it's pretty easy :).
Last week, I showed you a brief summary of using Python with scikit-learn to train your ML models. This enhances your possibilities of applying ML techniques vastly. Today was the follow-up: how to use your trained Python ML models again within Azure ML.
With today's knowledge, it's perfectly doable to participate in an Azure ML competition using your enhanced Python ML models, or with the help of your personal data scientist port his ingenious models towards the managed Azure ML environment.
Can you tell me what version of Scikit-Learn you used? AzureML currently runs version 15.1 which makes it difficult to use serialised models from later versions of Scikit-Learn on AzureML.
Koos van Strien
What problems do you run into?
If I remember it correctly I used the version that came with Anaconda 4.1.1 (64-bit) - which is version 0.17.1. However, I might have used Anaconda 4.0 in order to remain compatible (Anaconda 4.0 is the one used on Azure)..
how solve this problem
Error 0085: The following error occurred during script evaluation, please view the output log for more information:
---------- Start of error message from Python interpreter ----------
Caught exception while executing function: Traceback (most recent call last):
File "C:\pyhome\lib\pickle.py", line 268, in _getattribute
obj = getattr(obj, subpath)
AttributeError: module 'sklearn.externals.joblib.numpy_pickle' has no attribute 'NumpyArrayWrapper'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\server\invokepy.py", line 199, in batch
odfs = mod.azureml_main(*idfs)
File "C:\temp\23b0a8ea22e745bdbc79961d5cf1d10a.py", line 22, in azureml_main
import helper #loads my_helper_script.py
File ".\Script Bundle\helper.py", line 10, in
trained_model = joblib.load('.\\Script Bundle\\clfGBR.p')
File "C:\pyhome\lib\site-packages\sklearn\externals\joblib\numpy_pickle.py", line 459, in load
obj = unpickler.load()
File "C:\pyhome\lib\pickle.py", line 1039, in load
File "C:\pyhome\lib\pickle.py", line 1343, in load_stack_global
File "C:\pyhome\lib\pickle.py", line 1386, in find_class
return _getattribute(sys.modules[module], name)
File "C:\pyhome\lib\pickle.py", line 271, in _getattribute
AttributeError: Can't get attribute 'NumpyArrayWrapper' on
Process returned with non-zero exit code 1
---------- End of error message from Python interpreter ----------
Start time: UTC 10/31/2017 12:54:11
End time: UTC 10/31/2017 12:54:28
Did you ever figure out this error?
Comments are closed.