endog, exog, what’s that?

statsmodels is using endog and exog as names for the data, the observed variables that are used in an estimation problem. Other names that are often used in different statistical packages or text books are, for example,

endog

exog

y

x

y variable

x variable

left hand side (LHS)

right hand side (RHS)

dependent variable

independent variable

regressand

regressors

outcome

design

response variable

explanatory variable

The usage is quite often domain and model specific; however, we have chosen to use endog and exog almost exclusively. A mnemonic hint to keep the two terms apart is that exogenous has an “x”, as in x-variable, in its name.

x and y are one letter names that are sometimes used for temporary variables and are not informative in itself. To avoid one letter names we decided to use descriptive names and settled on endog and exog. Since this has been criticized, this might change in future.

Background

Some informal definitions of the terms are

endogenous: caused by factors within the system

exogenous: caused by factors outside the system

Endogenous variables designates variables in an economic/econometric model that are explained, or predicted, by that model. http://stats.oecd.org/glossary/detail.asp?ID=794

Exogenous variables designates variables that appear in an economic/econometric model, but are not explained by that model (i.e. they are taken as given by the model). http://stats.oecd.org/glossary/detail.asp?ID=890

In econometrics and statistics the terms are defined more formally, and different definitions of exogeneity (weak, strong, strict) are used depending on the model. The usage in statsmodels as variable names cannot always be interpreted in a formal sense, but tries to follow the same principle.

In the simplest form, a model relates an observed variable, y, to another set of variables, x, in some linear or nonlinear form

y = f(x, beta) + noise
y = x * beta + noise

However, to have a statistical model we need additional assumptions on the properties of the explanatory variables, x, and the noise. One standard assumption for many basic models is that x is not correlated with the noise. In a more general definition, x being exogenous means that we do not have to consider how the explanatory variables in x were generated, whether by design or by random draws from some underlying distribution, when we want to estimate the effect or impact that x has on y, or test a hypothesis about this effect.

In other words, y is endogenous to our model, x is exogenous to our model for the estimation.

As an example, suppose you run an experiment and for the second session some subjects are not available anymore. Is the drop-out relevant for the conclusions you draw for the experiment? In other words, can we treat the drop-out decision as exogenous for our problem.

It is up to the user to know (or to consult a text book to find out) what the underlying statistical assumptions for the models are. As an example, exog in OLS can have lagged dependent variables if the error or noise term is independently distributed over time (or uncorrelated over time). However, if the error terms are autocorrelated, then OLS does not have good statistical properties (is inconsistent) and the correct model will be ARMAX. statsmodels has functions for regression diagnostics to test whether some of the assumptions are justified or not.