BaseN Encoding and Grid Search in category_encoders

In the past I’ve posted about the various categorical encoding methods one can use for machine learning tasks, like one-hot encoding, ordinal or binary.  In my OSS package, category_encodings, I’ve added a single scikit-learn compatible encoder called BaseNEncoder, which allows the user to pick a base (2 for binary, N for ordinal, 1 for one-hot, or anywhere in between), and get consistently encoded categorical variables out.  Note that base 1 and one-hot aren’t really the same thing, but in this case it’s convenient to consider them as such.

Practically, this adds very little new functionality, rarely do people use base-3 or base-8 or any base other than ordinal or binary in real problems.  Where it becomes useful, however, is when this encoder is coupled with a grid search.

from __future__ import print_function
from sklearn import datasets
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from category_encoders.basen import BaseNEncoder
from examples.source_data.loaders import get_mushroom_data
from sklearn.linear_model import LogisticRegression

# first we get data from the mushroom dataset
X, y, _ = get_mushroom_data()
X = X.values  # use numpy array not dataframe here
n_samples = X.shape[0]

# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

# create a pipeline
ppl = Pipeline([
    ('enc', BaseNEncoder(base=2, return_df=False, verbose=True)),
    ('clf', LogisticRegression())
])


# Set the parameters by cross-validation
tuned_parameters = {
    'enc__base': [1, 2, 3, 4, 5, 6]
}

scores = ['precision', 'recall']

for score in scores:
    print("# Tuning hyper-parameters for %sn" % score)
    clf = GridSearchCV(ppl, tuned_parameters, cv=5, scoring='%s_macro' % score)
    clf.fit(X_train, y_train)

    print("Best parameters set found on development set:n")
    print(clf.best_params_)
    print("nGrid scores on development set:n")
    for mean, std, params in clf.grid_scores_:
        print("%s (+/-%s) for %s" % (mean, std * 2, params))

    print("nDetailed classification report:n")
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.n")
    y_true, y_pred = y_test, clf.predict(X_test)
    print(classification_report(y_true, y_pred))

This code, from HERE, uses a normal scikit-learn grid search to find the optimal base for encoding categorical variables.  The trade-off between between how well pairwise distances between categories and the final dataset dimensionality is no longer a difficult to tune parameter.

By running the above script we get:

Best parameters set found on development set:

{'enc__base': 1}

Grid scores on development set:

{'enc__base': 1} (+/-1.99905151856) for [ 1.         1.         1.         1.         0.9976247]
{'enc__base': 2} (+/-1.98737951324) for [ 0.9805492   0.99763033  0.99621212  0.9964455   0.9976247 ]
{'enc__base': 3} (+/-1.95968049624) for [ 0.99411765  0.98387419  0.9651717   0.96970966  0.98633155]
{'enc__base': 4} (+/-1.96534331006) for [ 0.99500636  0.96541172  0.98387419  0.99013767  0.97892831]
{'enc__base': 5} (+/-1.96034803727) for [ 0.97773263  0.97556628  0.98636545  0.97058734  0.99063232]
{'enc__base': 6} (+/-1.93791104567) for [ 0.96788716  0.95480882  0.97648608  0.97769848  0.96790524]

Detailed classification report:

The model is trained on the full development set.
The scores are computed on the full evaluation set.

             precision    recall  f1-score   support

          0       1.00      1.00      1.00      2110
          1       1.00      1.00      1.00      1952

avg / total       1.00      1.00      1.00      4062

# Tuning hyper-parameters for recall

Best parameters set found on development set:

{'enc__base': 1}

Grid scores on development set:

{'enc__base': 1} (+/-1.99802826596) for [ 0.99761905  1.          1.          1.          0.99744898]
{'enc__base': 2} (+/-1.98660963142) for [ 0.98904035  0.98854962  0.99745547  1.          0.99148239]
{'enc__base': 3} (+/-1.88434381179) for [ 0.95086332  0.8547619   0.94664667  0.98862857  0.97008487]
{'enc__base': 4} (+/-1.98025257596) for [ 0.99261178  0.98005271  0.98436023  0.99618321  0.99744898]
{'enc__base': 5} (+/-1.93166516505) for [ 0.98530534  0.98657761  0.89642857  0.9800385   0.98086735]
{'enc__base': 6} (+/-1.94647463413) for [ 0.96687568  0.97385496  0.99507452  0.95912053  0.97123861]

Detailed classification report:

The model is trained on the full development set.
The scores are computed on the full evaluation set.

             precision    recall  f1-score   support

          0       1.00      1.00      1.00      2110
          1       1.00      1.00      1.00      1952

avg / total       1.00      1.00      1.00      4062

Which shows us that for this relatively simple problem, with a small dataset, using the dimension-inefficient one-hot encoding (base=1) is the best option available.
We’ve got a lot of cool projects in the pipeline in preparation for the 1.3.0 release, and the first release since being included in scikit-learn-contrib, so if you’re interested in this kind of work, head over to github or reach out here to get involved.