Azure ML Thursday 4: ML in Python
8 september 2016 2016-09-08 10:20Azure ML Thursday 4: ML in Python
On this fourth Azure ML Thursday series we move our ML solution out of Azure ML and set our first steps in Python with scikit-learn. Today, we look at using "just" Python for doing ML, next week we bring the trained models to Azure ML. You'll notice there's a lot more to tweak and improve once you do your Machine Learning here! ML in Python is a quite large topic, so be many subjects will only be touched lightly. Nonetheless, I try to give just enough samples and basics to get your first ML models running in there!
Python, Anaconda, Jupyter and Azure ML Studio
Python is often used in conjunction with the scikit-learn collection of libraries. The most important libraries used for ML in Python are grouped inside a distribution called Anaconda. This is the distribution that's also used inside Azure ML[ref]at the time of writing Azure ML uses Anaconda 4.0[/ref]. Besides Python and scikit-learn, Anaconda contains all kinds of Data Science-oriented packages. It's a good idea to install Anaconda as a distribution and use Jupyter (formerly IPython) as development environment: Anaconda gives you almost the same environment on your local machine as your code will run in once in Azure ML. Jupyter gives you a nice way to keep code (in Python) and write / document (in Markdown) together.
Anaconda can be downloaded from https://www.continuum.io/downloads.
Jupyter - cloud or local?
Jupyter can be run from within Azure ML Studio too (currently in preview, it's called "Azure Notebooks"). The datasets that are available within the experiments can easily be used in a notebook, and Jupyter plays nicely together with Azure ML Experiments overall. Personally, I still favor a local Jupyter installation for two reasons:
- On Azure Notebooks, if you're idle for more than one hour the notebook server will be reclaimed & recycled (so no use in starting a grid CV running for multiple hours)
- I couldn't find a way to locate my pickled models at the Azure
I do still use Azure Notebooks, especially to double-test my locally developed model in the cloud (had some strange issues there). Jupyter is included with Anaconda.
The actual model development
When you've set your first steps inside Azure ML and cross over to Python/sklearn to perform Machine Learning, there are a few "new" things to learn:
- X and Y
- NA-values
- Non-existence of multiclass classifiers
- Splitting column types for preprocessing
- Feature stacking
X and Y
Inside Azure ML studio the terms X and Y are not used - but I think they're the most common terms in supervised Machine Learning:
- X: the features used as input for the predictive model. In the Iris Flower example: sepal_length, sepal_width, petal_width, petal_length
- Y: the features going to be predicted. In the Iris Flower example: class
Xtrain and Ytrain are X and Y, but limited to the rows of the training set.
NA-values
Scikit-learn algorithms cannot handle blank values (here encoded as a NaN). In Azure ML experiments, you usually clean blanks using "Clean Missing Data".
In scikit-learn, blanks are filled easily using an Imputer:
from sklearn.preprocessing import Imputer imp = Imputer(missing_values = 'NaN', strategy='median', axis=0) X = imp.fit_transform(X)
Two things to remember here:
- Imputer cannot handle textual columns - so in order to impute the most frequent values on textual (categorical) columns, you need to convert them to numbers first[ref]Another, perhaps better, option is to build a more advanced imputer yourself, which handles different column types in different ways. A good example is given in this StackOverflow answer[/ref]
- Imputer can use the median, mean or the most frequent value to fill the blanks. The median used here is the median of the training set. When processing the test set or predicting real-world values remember to use the already trained Imputer too! Every transformation for training should be repeated for prediction
Non-existence of multiclass classifiers (and how to work around that)
Azure ML Studio provides "multiclass models", which can be trained with one label column containing multiple classes. For example: one column "religion", four possible values: "Roman Catholic", "Muslim", "Eastern Orthodox" and "Jew". However, behind the scenes classification models often work a little different: any classification column can be only one or zero, but you can predict multiple columns - all of them being binary. Notice this enables you to transform the model: column "religion" can be transformed from one four-value column to to four binary columns "isRomanCatholic", "isMuslim", "isEasternOrthodox" and "isJew". For every row, one of these columns contains a one, the rest contains zeroes.
The process of translating the single multiclass column to multiple two-class columns is called One-Hot Encoding:
from sklearn.preprocessing import OneHotEncoder #X_Categorical is the subset of columns that are categorical from X enc = OneHotEncoder() X_Categorical = enc.fit_transform(X_Categorical)
Again, remember that exactly this process needs to be repeated for all predictions too!
Splitting column types for preprocessing
One-Hot encoding only needs to be done on multiclass columns that must be translated to multiple binary-class columns. In other words, we don't want to include numeric columns like age. In order to preprocess the different column types separately, we first split the columns. I use a helper function to do this[ref]this one could probably be made more pythonic[/ref]:
def indexesof(searchlist, indexlist): returnlist = [] for i in indexlist: returnlist.append(searchlist.index(i)) return returnlist
Using the helper function I define which column headers are used for split X into XCategorical, XNumeric and XBinary[ref]It's a good idea to do this before imputing: it enables you to impute values per column type, as mentioned above[/ref]
numeric_columns = ["age", "Debut", "babydoc"] binary_columns = ["christian", "muslim", "hindu", "other", "cellphone", "motorcycle", "radio", "cooker", "fridge", "furniture", "computer", "cart", "irrigation", "thrasher", "car", "generator", "EVER_HAD_SEX", "EVER_BEEN_PREGNANT", "CHILDREN", "india", "married", "inschool", "ownincome", "LaborDeliv", "ModCon", "usecondom", "hivknow", "lowlit", "highlit", "urban", "rural", "single"] categorical_columns = ["geo", "REGION_PROVINCE", "DISTRICT", "electricity", "tribe", "foodinsecurity", "religion", "educ", "multpart", "literacy", "urbanicity"] categorical_features = indexesof(column_mapping, categorical_columns) X_Categorical = X.swapaxes(0, 1)[categorical_features].swapaxes(0, 1) X_Numeric = X.swapaxes(0, 1)[indexesof(column_mapping, numeric_columns)].swapaxes(0, 1) X_Binary = X.swapaxes(0, 1)[indexesof(column_mapping, binary_columns)].swapaxes(0, 1)
Feature stacking
After preprocessing, we need to paste the preprocessed columns together again. This is called feature stacking and provided for in numpy:
import numpy as np X = np.hstack((X_Categorical.toarray(), X_Numeric, X_Binary))
Training the model
The actual model training is quite easy - here is all code you need:
from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier(n_estimators = 120, n_jobs = 1) clf.fit(X_train, y_train)
Of course, stratified splits, gridsearches and cross-validations are available too. I won't spell every detail, but just to show how easy it is:
Stratified split
from sklearn.cross_validation import StratifiedKFold eval_size = 0.4 kf = StratifiedKFold(y, round(1. / eval_size)) train_indices, valid_indices = next(iter(kf)) X_train, y_train = X[train_indices], y[train_indices] X_valid, y_valid = X[valid_indices], y[valid_indices]
Gridsearch
As you see, this is a quite extensive search (takes some hours) - not recommended, but it shows the workings! Notice that this grid search does a cross-validation by default (here with 5 folds)
from sklearn.grid_search import GridSearchCV from sklearn.metrics import mean_squared_error from time import time tuned_params = {'n_estimators': [120, 300, 500, 800, 1200] , 'max_depth': [5, 8, 15, 25, 30, None] , 'min_samples_split': [1, 2, 5, 10, 15, 100] , 'min_samples_leaf': [1, 2, 5, 10] , 'max_features': ['log2', 'sqrt', None]} print("# Tuning hyper-parameters for mean_squared_error") gscv = GridSearchCV(RandomForestClassifier(random_state = 0), tuned_params, cv=5, scoring='mean_squared_error') start = time() gscv.fit(X_train, y_train) print("The grid cross validation lasted {0:0.2f} seconds".format(time() - start)) print(gscv.best_params_) print("Grid scores on development set:") for params, mean_score, scores in gscv.grid_scores_: print("{0:0.3f} (+/-{1:0.3f}) for {2:s}" .format(mean_score, scores.std() * 2, str(params)))
Cross-validation
Imagine GridSearchCV, but with all parameters fixed...
Conclusion
The whole zoo of Python, Anaconda and pandas may seem daunting at first, but it's not very hard to make the move from Azure ML to Machine Learning in Python. I spent a lot of time on figuring out how to mold the dataframes for the different operations - in shape (as_matrix, unstack, etc.) as well as in conversions. Next week we'll look how to include trained scikit-learn models in AzureML. To close this week's Azure ML Thursday I'll finish with one last code listing for a basic model on the Women's Health dataset[ref]Mind you, this challenge is still running[/ref]. Disclaimer: it's deliberately kept very basic: no custom imputer per datatype, not a very good predicting model... Up to you to tweak it using other models, tweak parameters and so on. Feel free to post your improvements in comments below!
import pandas as pd import numpy as np w = pd.read_csv("../Documents/WomenHealth_Training.csv") w["combined_label"] = w["geo"] * 100 + w["segment"] * 10 + w["subgroup"] #Construct X by removing identification column & labels: X = w.drop(["segment", "subgroup", "INTNR", "combined_label"], axis=1) #Convert textual column to numeric. #This is one-hot encoding, but in order to one-hot encode you can't have NAs. #But in order to do imputing, you can't have textual columns. X["religion"] = X["religion"].map({'Hindu': 1, 'Evangelical/Bo': 2, 'Muslim': 3, 'Roman Catholic': 4, 'Other Christia': 5, 'Buddhist': 6, 'Russian/Easter': 7, 'Traditional/An': 8, 'Other': 9, 'Jewish': 10}) column_mapping = X.columns.tolist() #Construct Y #unstack as matrix, so an array with the right shape for creating a stratified K fold y = w[["combined_label"]].unstack().as_matrix() #Impute missing values from sklearn.preprocessing import Imputer imp = Imputer(missing_values = 'NaN', strategy='median', axis=0) X = imp.fit_transform(X) #Add some knowledge about column types. numeric_columns = ["age", "Debut", "babydoc"] binary_columns = ["christian", "muslim", "hindu", "other", "cellphone", "motorcycle", "radio", "cooker", "fridge", "furniture", "computer", "cart", "irrigation", "thrasher", "car", "generator", "EVER_HAD_SEX", "EVER_BEEN_PREGNANT", "CHILDREN", "india", "married", "inschool", "ownincome", "LaborDeliv", "ModCon", "usecondom", "hivknow", "lowlit", "highlit", "urban", "rural", "single"] categorical_columns = ["geo", "REGION_PROVINCE", "DISTRICT", "electricity", "tribe", "foodinsecurity", "religion", "educ", "multpart", "literacy", "urbanicity"] #Helper function for returning a list of column numbers def indexesof(searchlist, indexlist): returnlist = [] for i in indexlist: returnlist.append(searchlist.index(i)) return returnlist #Group Numeric & Binary columns X_Numeric = X[indexesof(column_mapping, numeric_columns)] X_Binary = X[indexesof(column_mapping, binary_columns)] #Perform One-Hot encoding on categorical columns from sklearn.preprocessing import OneHotEncoder categorical_features = indexesof(column_mapping, categorical_columns) X_Categorical = X[categorical_features] enc = OneHotEncoder() X_Categorical = enc.fit_transform(X_Categorical) #Stack features X = np.hstack((X_Categorical, X_Numeric, X_Binary)) #Train Gaussian Naive Bayes model #(Hint: this doesn't perform very well ;-)) from sklearn.naive_bayes import GaussianNB from sklearn.grid_search import GridSearchCV from sklearn.metrics import mean_squared_error from time import time tuned_params = {} print("# Tuning hyper-parameters for mean_squared_error") gscv = GridSearchCV(GaussianNB(), tuned_params, cv=5, scoring='mean_squared_error') start = time() gscv.fit(X, y) print("The grid cross validation lasted {0:0.2f} seconds".format(time() - start)) print(gscv.best_params_) print("Grid scores on development set:") for params, mean_score, scores in gscv.grid_scores_: print("{0:0.3f} (+/-{1:0.3f}) for {2:s}" .format(mean_score, scores.std() * 2, str(params)))