Prediction (out of sample)

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm
/build/statsmodels-0.8.0/.pybuild/cpython3_3.7_statsmodels/build/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools
/usr/lib/python3/dist-packages/scipy/_lib/_util.py:18: RuntimeWarning: invalid value encountered in multiply
  out = np.ones(shape, dtype=bool) * value

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.987
Model:                            OLS   Adj. R-squared:                  0.987
Method:                 Least Squares   F-statistic:                     1199.
Date:                Sat, 21 Nov 2020   Prob (F-statistic):           1.17e-43
Time:                        06:55:42   Log-Likelihood:                 6.8591
No. Observations:                  50   AIC:                            -5.718
Df Residuals:                      46   BIC:                             1.930
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9101      0.075     65.501      0.000       4.759       5.061
x1             0.5045      0.012     43.634      0.000       0.481       0.528
x2             0.5178      0.045     11.394      0.000       0.426       0.609
x3            -0.0201      0.001    -19.774      0.000      -0.022      -0.018
==============================================================================
Omnibus:                        3.619   Durbin-Watson:                   2.064
Prob(Omnibus):                  0.164   Jarque-Bera (JB):                2.597
Skew:                          -0.513   Prob(JB):                        0.273
Kurtosis:                       3.442   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
/usr/lib/python3/dist-packages/scipy/_lib/_util.py:18: RuntimeWarning: invalid value encountered in multiply
  out = np.ones(shape, dtype=bool) * value

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[ 4.40832498  4.89835336  5.3479237   5.7288141   6.0229878   6.22555656
  6.34558379  6.40459546  6.43304343  6.46530232  6.53402183  6.66476251
  6.87179622  7.1557613   7.50355772  7.89049933  8.28436977  8.65071552
  8.95850647  9.18523416  9.32060977  9.36825461  9.34510511  9.27863023
  9.20231839  9.15017541  9.15113743  9.22431684  9.37586211  9.59794765
  9.8700595  10.16236576 10.44061784 10.67177745 10.82944466 10.89819535
 10.87611537 10.77511497 10.61897206 10.4394255  10.27096051 10.14514289
 10.08543279 10.1033303  10.19648602 10.34908706 10.53445503 10.71942927
 10.869815   10.95600147]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.94557829 10.79873555 10.53578105 10.20299365  9.86129262  9.57132246
  9.37860504  9.30239423  9.33096189  9.42446934]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           4.910118
x1                  0.504463
np.sin(x1)          0.517841
I((x1 - 5) ** 2)   -0.020072
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
0    10.945578
1    10.798736
2    10.535781
3    10.202994
4     9.861293
5     9.571322
6     9.378605
7     9.302394
8     9.330962
9     9.424469
dtype: float64