In the past I’ve posted about the various categorical encoding methods one can use for machine learning tasks, like one-hot encoding, ordinal or binary. In my OSS package, category_encodings, I’ve added a single scikit-learn compatible encoder called BaseNEncoder, which allows the user to pick a base (2 for binary, N for ordinal, 1 for one-hot, or anywhere in between), and get consistently encoded categorical variables out. Note that base 1 and one-hot aren’t really the same thing, but in this case it’s convenient to consider them as such.
Practically, this adds very little new functionality, rarely do people use base-3 or base-8 or any base other than ordinal or binary in real problems. Where it becomes useful, however, is when this encoder is coupled with a grid search.
from __future__ import print_function from sklearn import datasets from sklearn.grid_search import GridSearchCV from sklearn.cross_validation import train_test_split from sklearn.metrics import classification_report from sklearn.pipeline import Pipeline from category_encoders.basen import BaseNEncoder from examples.source_data.loaders import get_mushroom_data from sklearn.linear_model import LogisticRegression # first we get data from the mushroom dataset X, y, _ = get_mushroom_data() X = X.values # use numpy array not dataframe here n_samples = X.shape[0] # Split the dataset in two equal parts X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0) # create a pipeline ppl = Pipeline([ ('enc', BaseNEncoder(base=2, return_df=False, verbose=True)), ('clf', LogisticRegression()) ]) # Set the parameters by cross-validation tuned_parameters = { 'enc__base': [1, 2, 3, 4, 5, 6] } scores = ['precision', 'recall'] for score in scores: print("# Tuning hyper-parameters for %sn" % score) clf = GridSearchCV(ppl, tuned_parameters, cv=5, scoring='%s_macro' % score) clf.fit(X_train, y_train) print("Best parameters set found on development set:n") print(clf.best_params_) print("nGrid scores on development set:n") for mean, std, params in clf.grid_scores_: print("%s (+/-%s) for %s" % (mean, std * 2, params)) print("nDetailed classification report:n") print("The model is trained on the full development set.") print("The scores are computed on the full evaluation set.n") y_true, y_pred = y_test, clf.predict(X_test) print(classification_report(y_true, y_pred))
This code, from HERE, uses a normal scikit-learn grid search to find the optimal base for encoding categorical variables. The trade-off between between how well pairwise distances between categories and the final dataset dimensionality is no longer a difficult to tune parameter.
By running the above script we get:
Best parameters set found on development set: {'enc__base': 1} Grid scores on development set: {'enc__base': 1} (+/-1.99905151856) for [ 1. 1. 1. 1. 0.9976247] {'enc__base': 2} (+/-1.98737951324) for [ 0.9805492 0.99763033 0.99621212 0.9964455 0.9976247 ] {'enc__base': 3} (+/-1.95968049624) for [ 0.99411765 0.98387419 0.9651717 0.96970966 0.98633155] {'enc__base': 4} (+/-1.96534331006) for [ 0.99500636 0.96541172 0.98387419 0.99013767 0.97892831] {'enc__base': 5} (+/-1.96034803727) for [ 0.97773263 0.97556628 0.98636545 0.97058734 0.99063232] {'enc__base': 6} (+/-1.93791104567) for [ 0.96788716 0.95480882 0.97648608 0.97769848 0.96790524] Detailed classification report: The model is trained on the full development set. The scores are computed on the full evaluation set. precision recall f1-score support 0 1.00 1.00 1.00 2110 1 1.00 1.00 1.00 1952 avg / total 1.00 1.00 1.00 4062 # Tuning hyper-parameters for recall Best parameters set found on development set: {'enc__base': 1} Grid scores on development set: {'enc__base': 1} (+/-1.99802826596) for [ 0.99761905 1. 1. 1. 0.99744898] {'enc__base': 2} (+/-1.98660963142) for [ 0.98904035 0.98854962 0.99745547 1. 0.99148239] {'enc__base': 3} (+/-1.88434381179) for [ 0.95086332 0.8547619 0.94664667 0.98862857 0.97008487] {'enc__base': 4} (+/-1.98025257596) for [ 0.99261178 0.98005271 0.98436023 0.99618321 0.99744898] {'enc__base': 5} (+/-1.93166516505) for [ 0.98530534 0.98657761 0.89642857 0.9800385 0.98086735] {'enc__base': 6} (+/-1.94647463413) for [ 0.96687568 0.97385496 0.99507452 0.95912053 0.97123861] Detailed classification report: The model is trained on the full development set. The scores are computed on the full evaluation set. precision recall f1-score support 0 1.00 1.00 1.00 2110 1 1.00 1.00 1.00 1952 avg / total 1.00 1.00 1.00 4062
Which shows us that for this relatively simple problem, with a small dataset, using the dimension-inefficient one-hot encoding (base=1) is the best option available.
We’ve got a lot of cool projects in the pipeline in preparation for the 1.3.0 release, and the first release since being included in scikit-learn-contrib, so if you’re interested in this kind of work, head over to github or reach out here to get involved.