.. currentmodule:: statsmodels.stats.contingency_tables

.. _contingency_tables:


Contingency tables
==================

Statsmodels supports a variety of approaches for analyzing contingency
tables, including methods for assessing independence, symmetry,
homogeneity, and methods for working with collections of tables from a
stratified population.

The methods described here are mainly for two-way tables.  Multi-way
tables can be analyzed using log-linear models.  Statsmodels does not
currently have a dedicated API for loglinear modeling, but Poisson
regression in :class:`statsmodels.genmod.GLM` can be used for this
purpose.

A contingency table is a multi-way table that describes a data set in
which each observation belongs to one category for each of several
variables.  For example, if there are two variables, one with
:math:`r` levels and one with :math:`c` levels, then we have a
:math:`r \times c` contingency table.  The table can be described in
terms of the number of observations that fall into a given cell of the
table, e.g. :math:`T_{ij}` is the number of observations that have
level :math:`i` for the first variable and level :math:`j` for the
second variable.  Note that each variable must have a finite number of
levels (or categories), which can be either ordered or unordered.  In
different contexts, the variables defining the axes of a contingency
table may be called **categorical variables** or **factor variables**.
They may be either **nominal** (if their levels are unordered) or
**ordinal** (if their levels are ordered).

The underlying population for a contingency table is described by a
**distribution table** :math:`P_{i, j}`.  The elements of :math:`P`
are probabilities, and the sum of all elements in :math:`P` is 1.
Methods for analyzing contingency tables use the data in :math:`T` to
learn about properties of :math:`P`.

The :class:`statsmodels.stats.Table` is the most basic class for
working with contingency tables.  We can create a ``Table`` object
directly from any rectangular array-like object containing the
contingency table cell counts:

.. ipython:: python

    import numpy as np
    import pandas as pd
    import statsmodels.api as sm

    df = sm.datasets.get_rdataset("Arthritis", "vcd").data

    tab = pd.crosstab(df['Treatment'], df['Improved'])
    tab = tab.loc[:, ["None", "Some", "Marked"]]
    table = sm.stats.Table(tab)

Alternatively, we can pass the raw data and let the Table class
construct the array of cell counts for us:

.. ipython:: python

    data = df[["Treatment", "Improved"]]
    table = sm.stats.Table.from_data(data)


Independence
------------

**Independence** is the property that the row and column factors occur
independently. **Association** is the lack of independence.  If the
joint distribution is independent, it can be written as the outer
product of the row and column marginal distributions:

.. math::

    P_{ij} = \sum_k P_{ij} \cdot \sum_k P_{kj} \quad \text{for all} \quad  i, j

We can obtain the best-fitting independent distribution for our
observed data, and then view residuals which identify particular cells
that most strongly violate independence:

.. ipython:: python

    print(table.table_orig)
    print(table.fittedvalues)
    print(table.resid_pearson)

In this example, compared to a sample from a population in which the
rows and columns are independent, we have too many observations in the
placebo/no improvement and treatment/marked improvement cells, and too
few observations in the placebo/marked improvement and treated/no
improvement cells.  This reflects the apparent benefits of the
treatment.

If the rows and columns of a table are unordered (i.e. are nominal
factors), then the most common approach for formally assessing
independence is using Pearson's :math:`\chi^2` statistic.  It's often
useful to look at the cell-wise contributions to the :math:`\chi^2`
statistic to see where the evidence for dependence is coming from.

.. ipython:: python

    rslt = table.test_nominal_association()
    print(rslt.pvalue)
    print(table.chi2_contribs)

For tables with ordered row and column factors, we can us the **linear
by linear** association test to obtain more power against alternative
hypotheses that respect the ordering.  The test statistic for the
linear by linear association test is

.. math::

    \sum_k r_i c_j T_{ij}

where :math:`r_i` and :math:`c_j` are row and column scores.  Often
these scores are set to the sequences 0, 1, ....  This gives the
'Cochran-Armitage trend test'.

.. ipython:: python

    rslt = table.test_ordinal_association()
    print(rslt.pvalue)

We can assess the association in a :math:`r\times x` table by
constructing a series of :math:`2\times 2` tables and calculating
their odds ratios.  There are two ways to do this.  The **local odds
ratios** construct :math:`2\times 2` tables from adjacent row and
column categories.

.. ipython:: python

    print(table.local_oddsratios)
    taloc = sm.stats.Table2x2(np.asarray([[7, 29], [21, 13]]))
    print(taloc.oddsratio)
    taloc = sm.stats.Table2x2(np.asarray([[29, 7], [13, 7]]))
    print(taloc.oddsratio)

The **cumulative odds ratios** construct :math:`2\times 2` tables by
dichotomizing the row and column factors at each possible point.

.. ipython:: python

    print(table.cumulative_oddsratios)
    tab1 = np.asarray([[7, 29 + 7], [21, 13 + 7]])
    tacum = sm.stats.Table2x2(tab1)
    print(tacum.oddsratio)
    tab1 = np.asarray([[7 + 29, 7], [21 + 13, 7]])
    tacum = sm.stats.Table2x2(tab1)
    print(tacum.oddsratio)

A mosaic plot is a graphical approach to informally assessing
dependence in two-way tables.

.. ipython:: python

    from statsmodels.graphics.mosaicplot import mosaic
    fig, _ = mosaic(data, index=["Treatment", "Improved"])


Symmetry and homogeneity
------------------------

**Symmetry** is the property that :math:`P_{i, j} = P_{j, i}` for
every :math:`i` and :math:`j`.  **Homogeneity** is the property that
the marginal distribution of the row factor and the column factor are
identical, meaning that

.. math::

    \sum_j P_{ij} = \sum_j P_{ji} \forall i

Note that for these properties to be applicable the table :math:`P`
(and :math:`T`) must be square, and the row and column categories must
be identical and must occur in the same order.

To illustrate, we load a data set, create a contingency table, and
calculate the row and column margins.  The :class:`Table` class
contains methods for analyzing :math:`r \times c` contingency tables.
The data set loaded below contains assessments of visual acuity in
people's left and right eyes.  We first load the data and create a
contingency table.

.. ipython:: python

    df = sm.datasets.get_rdataset("VisualAcuity", "vcd").data
    df = df.loc[df.gender == "female", :]
    tab = df.set_index(['left', 'right'])
    del tab["gender"]
    tab = tab.unstack()
    tab.columns = tab.columns.get_level_values(1)
    print(tab)

Next we create a :class:`SquareTable` object from the contingency
table.

.. ipython:: python

    sqtab = sm.stats.SquareTable(tab)
    row, col = sqtab.marginal_probabilities
    print(row)
    print(col)


The ``summary`` method prints results for the symmetry and homogeneity
testing procedures.

.. ipython:: python

    print(sqtab.summary())

If we had the individual case records in a dataframe called ``data``,
we could also perform the same analysis by passing the raw data using
the ``SquareTable.from_data`` class method.

::

    sqtab = sm.stats.SquareTable.from_data(data[['left', 'right']])
    print(sqtab.summary())


A single 2x2 table
------------------

Several methods for working with individual 2x2 tables are provided in
the :class:`sm.stats.Table2x2` class.  The ``summary`` method displays
several measures of association between the rows and columns of the
table.

.. ipython:: python

    table = np.asarray([[35, 21], [25, 58]])
    t22 = sm.stats.Table2x2(table)
    print(t22.summary())

Note that the risk ratio is not symmetric so different results will be
obtained if the transposed table is analyzed.

.. ipython:: python

    table = np.asarray([[35, 21], [25, 58]])
    t22 = sm.stats.Table2x2(table.T)
    print(t22.summary())


Stratified 2x2 tables
---------------------

Stratification occurs when we have a collection of contingency tables
defined by the same row and column factors.  In the example below, we
have a collection of 2x2 tables reflecting the joint distribution of
smoking and lung cancer in each of several regions of China.  It is
possible that the tables all have a common odds ratio, even while the
marginal probabilities vary among the strata.  The 'Breslow-Day'
procedure tests whether the data are consistent with a common odds
ratio.  It appears below as the `Test of constant OR`.  The
Mantel-Haenszel procedure tests whether this common odds ratio is
equal to one.  It appears below as the `Test of OR=1`.  It is also
possible to estimate the common odds and risk ratios and obtain
confidence intervals for them.  The ``summary`` method displays all of
these results.  Individual results can be obtained from the class
methods and attributes.

.. ipython:: python

    data = sm.datasets.china_smoking.load_pandas()

    mat = np.asarray(data.data)
    tables = [np.reshape(x.tolist(), (2, 2)) for x in mat]

    st = sm.stats.StratifiedTable(tables)
    print(st.summary())


Module Reference
----------------

.. module:: statsmodels.stats.contingency_tables
   :synopsis: Contingency table analysis

.. currentmodule:: statsmodels.stats.contingency_tables

.. autosummary::
   :toctree: generated/

   Table
   Table2x2
   SquareTable
   StratifiedTable
   mcnemar
   cochrans_q

See also
--------

Scipy_ has several functions for analyzing contingency tables,
including Fisher's exact test which is not currently in Statsmodels.

.. _Scipy: https://docs.scipy.org/doc/scipy-0.18.0/reference/stats.html#contingency-table-functions