Automatically Learning From Data - Logistic Regression With L2 Regularization in Python

Logistic Regressionis susceptible to overfitting, possibly making
Logistic regression is used for binary classificationlogistic-regression-without-regularization look worse
problems -- where you have some examples thatthan it is.
are "on" and other examples that are "off." YouThe Python Codefrom scipy.optimize.optimize
get as input a training set; which has someimport fmin_cg, fmin_bfgs, fminimport numpy as
examples of each class along with a label sayingnpdef sigmoid(x):return 1.0 / (1.0 +
whether each example is "on" or "off". The goal isnp.exp(-x))class SyntheticClassifierData():def
to learn a model from the training data so that__init__(self, N, d):
you can predict the label of new examples that""" Create N instances of d dimensional input
you haven't seen before and don't know the labelvectors and a 1Dclass label (-1 or 1). """means =
of..05 * np.random.randn(2, d)self.X_train =
np.zeros((N, d))self.Y_train = np.zeros(N)for i in
One of the first models that would be worthrange(N):if np.random.random() > .5:y = 1else:y
trying is logistic regression.= 0self.X_train[i, :] = np.random.random(d) +
Coding it upmeans[y, :]self.Y_train[i] = 2.0 * y - 1self.X_test =
I wasn't working on this exact problem, but I wasnp.zeros((N, d))self.Y_test = np.zeros(N)for i in
working on something close. Being one to practicerange(N):if np.random.randn() > .5:y = 1else:y =
what I preach, I started looking for a dead simple0self.X_test[i, :] = np.random.random(d) +
Python logistic regression class. The onlymeans[y, :]self.Y_test[i] = 2.0 * y - 1class
requirement is that I wanted it to support L2LogisticRegression():
regularization (more on this later). I'm also sharing""" A simple logistic regression model with L2
this code with a bunch of other people on manyregularization (zero-mean
platforms, so I wanted as few dependencies onGaussian priors on parameters). """def
external libraries as possible.__init__(self, x_train=None, y_train=None,
I couldn't find exactly what I wanted, so I decidedx_test=None, y_test=None,alpha=.1,
to take a stroll down memory lane and implementsynthetic=False):
it myself. I've written it in C++ and Matlab before# Set L2 regularization strengthself.alpha = alpha
but never in Python.# Set the data.self.set_data(x_train, y_train,
I won't do the derivation, but there are plenty ofx_test, y_test)
good explanations out there to follow if you're not# Initialize parameters to zero, for lack of a
afraid of a little calculus. Just do a little Googlingbetter choice.self.betas =
for "logistic regression derivation." The big idea isnp.zeros(self.x_train.shape[1])def negative_lik(self,
to write down the probability of the data givenbetas):return -1 * self.lik(betas)def lik(self, betas):
some setting of internal parameters, then to take""" Likelihood of the data under the current
the derivative, which will tell you how to changesettings of parameters. """
the internal parameters to make the data more# Data likelihoodl = 0for i in range(self.n):l +=
likely. Got it? Good.log(sigmoid(self.y_train[i] *np.dot(betas,
For those of you out there that know logisticself.x_train[i,:])))
regression inside and out, take a look at how# Prior likelihoodfor k in range(1,
short the train() method is. I really like how easyself.x_train.shape[1]):l -= (self.alpha / 2.0) *
it is to do in Python.self.betas[k]**2return ldef train(self):
Regularization""" Define the gradient and hand it off to a scipy
I caught a little indirect flak during March madnessgradient-basedoptimizer. """
season for talking about how I regularized the# Define the derivative of the likelihood with
latent vectors in my matrix-factorization model ofrespect to beta_k.
team offensive and defensive strengths when# Need to multiply by -1 because we will be
predicting outcomes in NCAA basketball.minimizing.dB_k = lambda B, k : np.sum([-self.alpha
Apparently people thought I was talking nonsense* B[k] +self.y_train[i] * self.x_train[i, k]
-- crazy, right?*sigmoid(-self.y_train[i] *np.dot(B,
But seriously, guys -- regularization is a good idea.self.x_train[i,:]))for i in range(self.n)]) * -1
Let me drive home the point. Take a look at the# The full gradient is just an array of
results of running the code (linked at the bottom).componentwise derivativesdB = lambda B :
Take a look at the top row.np.array([dB_k(B, k)for k in
On the left side, you have the training set. Thererange(self.x_train.shape[1])])
are 25 examples laid out along the x axis, and the# Optimizeself.betas = fmin_bfgs(self.negative_lik,
y axis tells you if the example is "on" (1) or "off"self.betas, fprime=dB)def set_data(self, x_train,
(0). For each of these examples, there's a vectory_train, x_test, y_test):
describing its attributes that I'm not showing.""" Take data that's already been generated.
After training the model, I ask the model to"""self.x_train = x_trainself.y_train =
ignore the known training set labels and toy_trainself.x_test = x_testself.y_test =
estimate the probability that each label is "on"y_testself.n = y_train.shape[0]def
based only on the examples's description vectorstraining_reconstruction(self):p_y1 =
and what the model has learned (hopefully thingsnp.zeros(self.n)for i in range(self.n):p_y1[i] =
like stronger earthquakes and older buildingssigmoid(np.dot(self.betas, self.x_train[i,:]))return
increase the likelihood of collapse). The probabilitiesp_y1def test_predictions(self):p_y1 =
are shown by the red X's. In the top left, the rednp.zeros(self.n)for i in range(self.n):p_y1[i] =
X's are right on top of the blue dots, so it is verysigmoid(np.dot(self.betas, self.x_test[i,:]))return
sure about the labels of the examples, and it'sp_y1def
always correct.), .5 + .5 * self.y_train, 'bo')plot(np.arange(self.n),
Now on the right side, we have some newself.training_reconstruction(), 'rx')ylim([-.1, 1.1])def
examples that the model hasn't seen before. Thisplot_test_predictions(self):plot(np.arange(self.n), .5
is called the test set. This is essentially the same+ .5 * self.y_test, 'yo')plot(np.arange(self.n),
as the left side, but the model knows nothingself.test_predictions(), 'rx')ylim([-.1, 1.1])if
about the test set class labels (yellow dots). What__name__ == "__main__":from pylab import *
you see is that it still does a decent job of# Create 20 dimensional data set with 25 points
predicting the labels, but there are some troubling-- this will be
cases where it is very confident and very wrong.# susceptible to overfitting.data =
This is known as overfitting.SyntheticClassifierData(25, 20)
This is where regularization comes in. As you go# Run for a variety of regularization
down the rows, there is stronger L2 regularizationstrengthsalphas = [0, .001, .01, .1]for j, a in
-- or equivalently, pressure on the internalenumerate(alphas):
parameters to be zero. This has the effect of# Create a new learner, but use the same data
reducing the model's certainty. Just because it canfor each runlr =
perfectly reconstruct the training set doesn'tLogisticRegression(x_train=data.X_train,
mean that it has everything figured out. You cany_train=data.Y_train,x_test=data.X_test,
imagine that if you were relying on this model toy_test=data.Y_test,alpha=a)print "Initial
make important decisions, it would be desirable tolikelihood:"print lr.lik(lr.betas)
have at least a bit of regularization in there.# Train the modellr.train()
And here's the code. It looks long, but most of it# Display execution infoprint "Final betas:"print
is to generate the data and plot the results. Thelr.betasprint "Final lik:"print lr.lik(lr.betas)
bulk of the work is done in the train() method,# Plot the resultssubplot(len(alphas), 2, 2*j +
which is only three (dense) lines. It requires1)lr.plot_training_reconstruction()ylabel("Alpha=%s"
numpy, scipy, and pylab.% a)if j == 0:title("Training set
* For full disclosure, I should admit that Ireconstructions")subplot(len(alphas), 2, 2*j + 2)lr.
generated my random data in a way such that it