Robust Linear Models¶

%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.sandbox.regression.predstd import wls_prediction_std

Estimation¶

Load data:

data = sm.datasets.stackloss.load(as_pandas=False)
data.exog = sm.add_constant(data.exog)

Huber's T norm with the (default) median absolute deviation scaling

huber_t = sm.RLM(data.endog, data.exog, M=sm.robust.norms.HuberT())
hub_results = huber_t.fit()
print(hub_results.params)
print(hub_results.bse)
print(hub_results.summary(yname='y',
            xname=['var_%d' % i for i in range(len(hub_results.params))]))

[-41.02649835   0.82938433   0.92606597  -0.12784672]
[9.79189854 0.11100521 0.30293016 0.12864961]
                    Robust linear Model Regression Results                    
==============================================================================
Dep. Variable:                      y   No. Observations:                   21
Model:                            RLM   Df Residuals:                       17
Method:                          IRLS   Df Model:                            3
Norm:                          HuberT                                         
Scale Est.:                       mad                                         
Cov Type:                          H1                                         
Date:                Fri, 19 Jul 2019                                         
Time:                        16:52:24                                         
No. Iterations:                    19                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
var_0        -41.0265      9.792     -4.190      0.000     -60.218     -21.835
var_1          0.8294      0.111      7.472      0.000       0.612       1.047
var_2          0.9261      0.303      3.057      0.002       0.332       1.520
var_3         -0.1278      0.129     -0.994      0.320      -0.380       0.124
==============================================================================

If the model instance has been used for another fit with different fit
parameters, then the fit options might not be the correct ones anymore .

Huber's T norm with 'H2' covariance matrix

hub_results2 = huber_t.fit(cov="H2")
print(hub_results2.params)
print(hub_results2.bse)

[-41.02649835   0.82938433   0.92606597  -0.12784672]
[9.08950419 0.11945975 0.32235497 0.11796313]

Andrew's Wave norm with Huber's Proposal 2 scaling and 'H3' covariance matrix

andrew_mod = sm.RLM(data.endog, data.exog, M=sm.robust.norms.AndrewWave())
andrew_results = andrew_mod.fit(scale_est=sm.robust.scale.HuberScale(), cov="H3")
print('Parameters: ', andrew_results.params)

Parameters:  [-40.8817957    0.79276138   1.04857556  -0.13360865]

See help(sm.RLM.fit) for more options and module sm.robust.scale for scale options

Comparing OLS and RLM¶

Artificial data with outliers:

nsample = 50
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, (x1-5)**2))
X = sm.add_constant(X)
sig = 0.3   # smaller error variance makes OLS<->RLM contrast bigger
beta = [5, 0.5, -0.0]
y_true2 = np.dot(X, beta)
y2 = y_true2 + sig*1. * np.random.normal(size=nsample)
y2[[39,41,43,45,48]] -= 5   # add some outliers (10% of nsample)

Example 1: quadratic function with linear truth¶

Note that the quadratic term in OLS regression will capture outlier effects.

res = sm.OLS(y2, X).fit()
print(res.params)
print(res.bse)
print(res.predict())

[ 5.21834986  0.48921255 -0.01103338]
[0.43666444 0.0674151  0.0059652 ]
[ 4.94251529  5.18538997  5.42458839  5.66011054  5.89195643  6.12012606
  6.34461942  6.56543653  6.78257737  6.99604194  7.20583026  7.41194231
  7.6143781   7.81313763  8.0082209   8.1996279   8.38735864  8.57141312
  8.75179133  8.92849329  9.10151898  9.2708684   9.43654157  9.59853847
  9.75685911  9.91150349 10.06247161 10.20976346 10.35337905 10.49331838
 10.62958144 10.76216825 10.89107879 11.01631307 11.13787108 11.25575283
 11.36995832 11.48048755 11.58734052 11.69051722 11.79001766 11.88584184
 11.97798976 12.06646141 12.1512568  12.23237593 12.30981879 12.3835854
 12.45367574 12.52008982]

Estimate RLM:

resrlm = sm.RLM(y2, X).fit()
print(resrlm.params)
print(resrlm.bse)

[ 5.15821804e+00  4.73158546e-01 -7.81084559e-04]
[0.1242246  0.0191786  0.00169701]

Draw a plot to compare OLS estimates to the robust estimates:

fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(111)
ax.plot(x1, y2, 'o',label="data")
ax.plot(x1, y_true2, 'b-', label="True")
prstd, iv_l, iv_u = wls_prediction_std(res)
ax.plot(x1, res.fittedvalues, 'r-', label="OLS")
ax.plot(x1, iv_u, 'r--')
ax.plot(x1, iv_l, 'r--')
ax.plot(x1, resrlm.fittedvalues, 'g.-', label="RLM")
ax.legend(loc="best")

<matplotlib.legend.Legend at 0x7f2d29388320>

Example 2: linear function with linear truth¶

Fit a new OLS model using only the linear term and the constant:

X2 = X[:,[0,1]]
res2 = sm.OLS(y2, X2).fit()
print(res2.params)
print(res2.bse)

[5.66306274 0.37887873]
[0.37358961 0.03218999]

Estimate RLM:

resrlm2 = sm.RLM(y2, X2).fit()
print(resrlm2.params)
print(resrlm2.bse)

[5.18401372 0.4663218 ]
[0.101559   0.00875073]

Draw a plot to compare OLS estimates to the robust estimates:

prstd, iv_l, iv_u = wls_prediction_std(res2)

fig, ax = plt.subplots(figsize=(8,6))
ax.plot(x1, y2, 'o', label="data")
ax.plot(x1, y_true2, 'b-', label="True")
ax.plot(x1, res2.fittedvalues, 'r-', label="OLS")
ax.plot(x1, iv_u, 'r--')
ax.plot(x1, iv_l, 'r--')
ax.plot(x1, resrlm2.fittedvalues, 'g.-', label="RLM")
legend = ax.legend(loc="best")