Prediction (out of sample)

In [1]:
%matplotlib inline
In [2]:
from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.983
Model:                            OLS   Adj. R-squared:                  0.982
Method:                 Least Squares   F-statistic:                     885.1
Date:                Mon, 24 Jun 2019   Prob (F-statistic):           1.13e-40
Time:                        17:25:18   Log-Likelihood:                -1.3477
No. Observations:                  50   AIC:                             10.70
Df Residuals:                      46   BIC:                             18.34
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9304      0.088     55.816      0.000       4.753       5.108
x1             0.5071      0.014     37.226      0.000       0.480       0.535
x2             0.4869      0.054      9.092      0.000       0.379       0.595
x3            -0.0199      0.001    -16.620      0.000      -0.022      -0.017
==============================================================================
Omnibus:                        2.243   Durbin-Watson:                   2.286
Prob(Omnibus):                  0.326   Jarque-Bera (JB):                2.048
Skew:                           0.482   Prob(JB):                        0.359
Kurtosis:                       2.771   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.43345487  4.91154935  5.35126612  5.72606828  6.01899596  6.22545276
  6.35396097  6.42476133  6.46648759  6.511462    6.59038491  6.72729076
  6.93559932  7.21591082  7.55590735  7.93237667  8.31502591  8.67145859
  8.97249728  9.19697738  9.33522437  9.39064334  9.37915955  9.3266019
  9.26445938  9.22470747  9.2345547   9.31197232  9.46274171  9.67950452
  9.94297158 10.2250921  10.49366232 10.71761651 10.87213096 10.94270245
 10.92753109 10.83781606 10.69591558 10.53167354 10.37751623 10.26312477
 10.21055864 10.2306313  10.32113364 10.46719686 10.64373543 10.81956854
 10.96254303 11.04481606]

Create a new sample of explanatory variables Xnew, predict and plot

In [6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[11.03606797 10.89854225 10.65133418 10.33795955 10.01570043  9.74158052
  9.5584038   9.48427457  9.50816493  9.59261478]

Plot comparison

In [7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [8]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [9]:
res.params
Out[9]:
Intercept           4.930441
x1                  0.507136
np.sin(x1)          0.486924
I((x1 - 5) ** 2)   -0.019879
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [10]:
res.predict(exog=dict(x1=x1n))
Out[10]:
0    11.036068
1    10.898542
2    10.651334
3    10.337960
4    10.015700
5     9.741581
6     9.558404
7     9.484275
8     9.508165
9     9.592615
dtype: float64