Prediction (out of sample)

In [1]:
%matplotlib inline
In [2]:
from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.975
Model:                            OLS   Adj. R-squared:                  0.973
Method:                 Least Squares   F-statistic:                     598.9
Date:                Fri, 19 Jul 2019   Prob (F-statistic):           7.46e-37
Time:                        16:51:03   Log-Likelihood:                -9.1418
No. Observations:                  50   AIC:                             26.28
Df Residuals:                      46   BIC:                             33.93
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0345      0.103     48.768      0.000       4.827       5.242
x1             0.4938      0.016     31.014      0.000       0.462       0.526
x2             0.5429      0.063      8.674      0.000       0.417       0.669
x3            -0.0199      0.001    -14.224      0.000      -0.023      -0.017
==============================================================================
Omnibus:                        0.350   Durbin-Watson:                   1.998
Prob(Omnibus):                  0.839   Jarque-Bera (JB):                0.519
Skew:                           0.004   Prob(JB):                        0.772
Kurtosis:                       2.501   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.53743082  5.03230555  5.48515196  5.86638374  6.15709211  6.35215251
  6.46106654  6.50640084  6.52007953  6.5381391   6.59480773  6.71688173
  6.91932295  7.20280059  7.55358127  7.94578539  8.34563904  8.71702299
  9.02740697  9.2531943   9.38359874  9.42241668  9.38740337  9.30735578
  9.21738136  9.15312998  9.14493675  9.21283805  9.36327961  9.5880575
  9.86566601 10.16483095 10.44964798 10.6854815  10.84465526 10.91099946
 10.88250737 10.77166509 10.60340008 10.41098566 10.23057428 10.09525782
 10.02963063 10.0457488  10.14114934 10.29925487 10.4920972  10.68491247
 10.84185274 10.93187566]

Create a new sample of explanatory variables Xnew, predict and plot

In [6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.92106956 10.76782737 10.49343867 10.14641972  9.79063502  9.48966099
  9.29122031  9.2154977   9.25019802  9.35355652]

Plot comparison

In [7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [8]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [9]:
res.params
Out[9]:
Intercept           5.034542
x1                  0.493786
np.sin(x1)          0.542877
I((x1 - 5) ** 2)   -0.019884
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [10]:
res.predict(exog=dict(x1=x1n))
Out[10]:
0    10.921070
1    10.767827
2    10.493439
3    10.146420
4     9.790635
5     9.489661
6     9.291220
7     9.215498
8     9.250198
9     9.353557
dtype: float64