The Datasets Package¶
statsmodels
provides data sets (i.e. data and meta-data) for use in
examples, tutorials, model testing, etc.
Using Datasets from Stata¶
Using Datasets from R¶
The Rdatasets project gives access to the datasets available in R’s core datasets package and many other common R packages. All of these datasets are available to statsmodels by using the get_rdataset
function. The actual data is accessible by the data
attribute. For example:
In [1]: import statsmodels.api as sm
In [2]: duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")
In [3]: print(duncan_prestige.__doc__)
+----------+-------------------+
| Duncan | R Documentation |
+----------+-------------------+
Duncan's Occupational Prestige Data
-----------------------------------
Description
~~~~~~~~~~~
The ``Duncan`` data frame has 45 rows and 4 columns. Data on the
prestige and other characteristics of 45 U. S. occupations in 1950.
Usage
~~~~~
::
Duncan
Format
~~~~~~
This data frame contains the following columns:
type
Type of occupation. A factor with the following levels: ``prof``,
professional and managerial; ``wc``, white-collar; ``bc``,
blue-collar.
income
Percent of males in occupation earning $3500 or more in 1950.
education
Percent of males in occupation in 1950 who were high-school
graduates.
prestige
Percent of raters in NORC study rating occupation as excellent or
good in prestige.
Source
~~~~~~
Duncan, O. D. (1961) A socioeconomic index for all occupations. In
Reiss, A. J., Jr. (Ed.) *Occupations and Social Status.* Free Press
[Table VI-1].
References
~~~~~~~~~~
Fox, J. (2008) *Applied Regression Analysis and Generalized Linear
Models*, Second Edition. Sage.
Fox, J. and Weisberg, S. (2011) *An R Companion to Applied Regression*,
Second Edition, Sage.
In [4]: duncan_prestige.data.head(5)
Out[4]:
type income education prestige
accountant prof 62 86 82
pilot prof 72 76 83
architect prof 75 92 90
author prof 55 90 76
chemist prof 64 86 90
R Datasets Function Reference¶
Available Datasets¶
- American National Election Survey 1996
- Breast Cancer Data
- Bill Greene’s credit scoring data.
- Mauna Loa Weekly Atmospheric CO2 Data
- First 100 days of the US House of Representatives 1995
- World Copper Market 1951-1975 Dataset
- US Capital Punishment dataset.
- El Nino - Sea Surface Temperatures
- Engel (1857) food expenditure data
- Affairs dataset
- Grunfeld (1950) Investment Data
- Transplant Survival Data
- Longley dataset
- United States Macroeconomic data
- Travel Mode Choice
- Nile River flows at Ashwan 1871-1970
- RAND Health Insurance Experiment Data
- Taxation Powers Vote for the Scottish Parliamant 1997
- Spector and Mazzeo (1980) - Program Effectiveness Data
- Stack loss data
- Star98 Educational Dataset
- Statewide Crime Data 2009
- U.S. Strike Duration Data
- Yearly sunspots data 1700-2008
Usage¶
Load a dataset:
In [5]: import statsmodels.api as sm
In [6]: data = sm.datasets.longley.load()
The Dataset object follows the bunch pattern explained in proposal. The full dataset is available in the data
attribute.
In [7]: data.data
Out[7]:
rec.array([(60323.0, 83.0, 234289.0, 2356.0, 1590.0, 107608.0, 1947.0),
(61122.0, 88.5, 259426.0, 2325.0, 1456.0, 108632.0, 1948.0),
(60171.0, 88.2, 258054.0, 3682.0, 1616.0, 109773.0, 1949.0),
(61187.0, 89.5, 284599.0, 3351.0, 1650.0, 110929.0, 1950.0),
(63221.0, 96.2, 328975.0, 2099.0, 3099.0, 112075.0, 1951.0),
(63639.0, 98.1, 346999.0, 1932.0, 3594.0, 113270.0, 1952.0),
(64989.0, 99.0, 365385.0, 1870.0, 3547.0, 115094.0, 1953.0),
(63761.0, 100.0, 363112.0, 3578.0, 3350.0, 116219.0, 1954.0),
(66019.0, 101.2, 397469.0, 2904.0, 3048.0, 117388.0, 1955.0),
(67857.0, 104.6, 419180.0, 2822.0, 2857.0, 118734.0, 1956.0),
(68169.0, 108.4, 442769.0, 2936.0, 2798.0, 120445.0, 1957.0),
(66513.0, 110.8, 444546.0, 4681.0, 2637.0, 121950.0, 1958.0),
(68655.0, 112.6, 482704.0, 3813.0, 2552.0, 123366.0, 1959.0),
(69564.0, 114.2, 502601.0, 3931.0, 2514.0, 125368.0, 1960.0),
(69331.0, 115.7, 518173.0, 4806.0, 2572.0, 127852.0, 1961.0),
(70551.0, 116.9, 554894.0, 4007.0, 2827.0, 130081.0, 1962.0)],
dtype=[('TOTEMP', '<f8'), ('GNPDEFL', '<f8'), ('GNP', '<f8'), ('UNEMP', '<f8'), ('ARMED', '<f8'), ('POP', '<f8'), ('YEAR', '<f8')])
Most datasets hold convenient representations of the data in the attributes endog and exog:
In [8]: data.endog[:5]
Out[8]: array([ 60323., 61122., 60171., 61187., 63221.])
In [9]: data.exog[:5,:]
Out[9]:
array([[ 83. , 234289. , 2356. , 1590. , 107608. , 1947. ],
[ 88.5, 259426. , 2325. , 1456. , 108632. , 1948. ],
[ 88.2, 258054. , 3682. , 1616. , 109773. , 1949. ],
[ 89.5, 284599. , 3351. , 1650. , 110929. , 1950. ],
[ 96.2, 328975. , 2099. , 3099. , 112075. , 1951. ]])
Univariate datasets, however, do not have an exog attribute.
Variable names can be obtained by typing:
In [10]: data.endog_name
Out[10]: 'TOTEMP'
In [11]: data.exog_name
Out[11]: ['GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']
If the dataset does not have a clear interpretation of what should be an endog and exog, then you can always access the data or raw_data attributes. This is the case for the macrodata dataset, which is a collection of US macroeconomic data rather than a dataset with a specific example in mind. The data attribute contains a record array of the full dataset and the raw_data attribute contains an ndarray with the names of the columns given by the names attribute.
In [12]: type(data.data)
Out[12]: numpy.core.records.recarray
In [13]: type(data.raw_data)
Out[13]: numpy.ndarray
In [14]: data.names
Out[14]: ['TOTEMP', 'GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']
Loading data as pandas objects¶
For many users it may be preferable to get the datasets as a pandas DataFrame or
Series object. Each of the dataset modules is equipped with a load_pandas
method which returns a Dataset
instance with the data readily available as pandas objects:
In [15]: data = sm.datasets.longley.load_pandas()
In [16]: data.exog
Out[16]:
GNPDEFL GNP UNEMP ARMED POP YEAR
0 83.0 234289 2356 1590 107608 1947
1 88.5 259426 2325 1456 108632 1948
2 88.2 258054 3682 1616 109773 1949
3 89.5 284599 3351 1650 110929 1950
4 96.2 328975 2099 3099 112075 1951
5 98.1 346999 1932 3594 113270 1952
6 99.0 365385 1870 3547 115094 1953
7 100.0 363112 3578 3350 116219 1954
8 101.2 397469 2904 3048 117388 1955
9 104.6 419180 2822 2857 118734 1956
10 108.4 442769 2936 2798 120445 1957
11 110.8 444546 4681 2637 121950 1958
12 112.6 482704 3813 2552 123366 1959
13 114.2 502601 3931 2514 125368 1960
14 115.7 518173 4806 2572 127852 1961
15 116.9 554894 4007 2827 130081 1962
In [17]: data.endog
Out[17]:
0 60323
1 61122
2 60171
3 61187
4 63221
5 63639
6 64989
7 63761
8 66019
9 67857
10 68169
11 66513
12 68655
13 69564
14 69331
15 70551
Name: TOTEMP, dtype: float64
The full DataFrame is available in the data
attribute of the Dataset object
In [18]: data.data
Out[18]:
TOTEMP GNPDEFL GNP UNEMP ARMED POP YEAR
0 60323 83.0 234289 2356 1590 107608 1947
1 61122 88.5 259426 2325 1456 108632 1948
2 60171 88.2 258054 3682 1616 109773 1949
3 61187 89.5 284599 3351 1650 110929 1950
4 63221 96.2 328975 2099 3099 112075 1951
5 63639 98.1 346999 1932 3594 113270 1952
6 64989 99.0 365385 1870 3547 115094 1953
7 63761 100.0 363112 3578 3350 116219 1954
8 66019 101.2 397469 2904 3048 117388 1955
9 67857 104.6 419180 2822 2857 118734 1956
10 68169 108.4 442769 2936 2798 120445 1957
11 66513 110.8 444546 4681 2637 121950 1958
12 68655 112.6 482704 3813 2552 123366 1959
13 69564 114.2 502601 3931 2514 125368 1960
14 69331 115.7 518173 4806 2572 127852 1961
15 70551 116.9 554894 4007 2827 130081 1962
With pandas integration in the estimation classes, the metadata will be attached to model results:
Extra Information¶
If you want to know more about the dataset itself, you can access the following, again using the Longley dataset as an example
>>> dir(sm.datasets.longley)[:6]
['COPYRIGHT', 'DESCRLONG', 'DESCRSHORT', 'NOTE', 'SOURCE', 'TITLE']
Additional information¶
- The idea for a datasets package was originally proposed by David Cournapeau and can be found here with updates by Skipper Seabold.
- To add datasets, see the notes on adding a dataset.