The Datasets Package¶
statsmodels
provides data sets (i.e. data and meta-data) for use in
examples, tutorials, model testing, etc.
Using Datasets from Stata¶
webuse (data[, baseurl, as_df]) |
Download and return an example dataset from Stata. |
Using Datasets from R¶
The Rdatasets project gives access to the datasets available in R’s core datasets package and many other common R packages. All of these datasets are available to statsmodels by using the get_rdataset
function. The actual data is accessible by the data
attribute. For example:
In [1]: import statsmodels.api as sm
In [2]: duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")
In [3]: print(duncan_prestige.__doc__)
+----------+-------------------+
| Duncan | R Documentation |
+----------+-------------------+
Duncan's Occupational Prestige Data
-----------------------------------
Description
~~~~~~~~~~~
The ``Duncan`` data frame has 45 rows and 4 columns. Data on the
prestige and other characteristics of 45 U. S. occupations in 1950.
Usage
~~~~~
::
Duncan
Format
~~~~~~
This data frame contains the following columns:
type
Type of occupation. A factor with the following levels: ``prof``,
professional and managerial; ``wc``, white-collar; ``bc``,
blue-collar.
income
Percent of males in occupation earning $3500 or more in 1950.
education
Percent of males in occupation in 1950 who were high-school
graduates.
prestige
Percent of raters in NORC study rating occupation as excellent or
good in prestige.
Source
~~~~~~
Duncan, O. D. (1961) A socioeconomic index for all occupations. In
Reiss, A. J., Jr. (Ed.) *Occupations and Social Status.* Free Press
[Table VI-1].
References
~~~~~~~~~~
Fox, J. (2008) *Applied Regression Analysis and Generalized Linear
Models*, Second Edition. Sage.
Fox, J. and Weisberg, S. (2011) *An R Companion to Applied Regression*,
Second Edition, Sage.
In [4]: duncan_prestige.data.head(5)
Out[4]:
type income education prestige
accountant prof 62 86 82
pilot prof 72 76 83
architect prof 75 92 90
author prof 55 90 76
chemist prof 64 86 90
R Datasets Function Reference¶
get_rdataset (dataname[, package, cache]) |
download and return R dataset |
get_data_home ([data_home]) |
Return the path of the statsmodels data dir. |
clear_data_home ([data_home]) |
Delete all the content of the data home cache. |
Available Datasets¶
- American National Election Survey 1996
- Breast Cancer Data
- Bill Greene’s credit scoring data.
- Smoking and lung cancer in eight cities in China.
- Mauna Loa Weekly Atmospheric CO2 Data
- First 100 days of the US House of Representatives 1995
- World Copper Market 1951-1975 Dataset
- US Capital Punishment dataset.
- El Nino - Sea Surface Temperatures
- Engel (1857) food expenditure data
- Affairs dataset
- World Bank Fertility Data
- Grunfeld (1950) Investment Data
- Transplant Survival Data
- Longley dataset
- United States Macroeconomic data
- Travel Mode Choice
- Nile River flows at Ashwan 1871-1970
- RAND Health Insurance Experiment Data
- Taxation Powers Vote for the Scottish Parliamant 1997
- Spector and Mazzeo (1980) - Program Effectiveness Data
- Stack loss data
- Star98 Educational Dataset
- Statewide Crime Data 2009
- U.S. Strike Duration Data
- Yearly sunspots data 1700-2008
Usage¶
Load a dataset:
In [5]: import statsmodels.api as sm
In [6]: data = sm.datasets.longley.load()
The Dataset object follows the bunch pattern explained in proposal. The full dataset is available in the data
attribute.
In [7]: data.data
Out[7]:
rec.array([( 60323., 83. , 234289., 2356., 1590., 107608., 1947.),
( 61122., 88.5, 259426., 2325., 1456., 108632., 1948.),
( 60171., 88.2, 258054., 3682., 1616., 109773., 1949.),
( 61187., 89.5, 284599., 3351., 1650., 110929., 1950.),
( 63221., 96.2, 328975., 2099., 3099., 112075., 1951.),
( 63639., 98.1, 346999., 1932., 3594., 113270., 1952.),
( 64989., 99. , 365385., 1870., 3547., 115094., 1953.),
( 63761., 100. , 363112., 3578., 3350., 116219., 1954.),
( 66019., 101.2, 397469., 2904., 3048., 117388., 1955.),
( 67857., 104.6, 419180., 2822., 2857., 118734., 1956.),
( 68169., 108.4, 442769., 2936., 2798., 120445., 1957.),
( 66513., 110.8, 444546., 4681., 2637., 121950., 1958.),
( 68655., 112.6, 482704., 3813., 2552., 123366., 1959.),
( 69564., 114.2, 502601., 3931., 2514., 125368., 1960.),
( 69331., 115.7, 518173., 4806., 2572., 127852., 1961.),
( 70551., 116.9, 554894., 4007., 2827., 130081., 1962.)],
dtype=[('TOTEMP', '<f8'), ('GNPDEFL', '<f8'), ('GNP', '<f8'), ('UNEMP', '<f8'), ('ARMED', '<f8'), ('POP', '<f8'), ('YEAR', '<f8')])
Most datasets hold convenient representations of the data in the attributes endog and exog:
In [8]: data.endog[:5]
Out[8]: array([ 60323., 61122., 60171., 61187., 63221.])
In [9]: data.exog[:5,:]
Out[9]:
array([[ 83. , 234289. , 2356. , 1590. , 107608. , 1947. ],
[ 88.5, 259426. , 2325. , 1456. , 108632. , 1948. ],
[ 88.2, 258054. , 3682. , 1616. , 109773. , 1949. ],
[ 89.5, 284599. , 3351. , 1650. , 110929. , 1950. ],
[ 96.2, 328975. , 2099. , 3099. , 112075. , 1951. ]])
Univariate datasets, however, do not have an exog attribute.
Variable names can be obtained by typing:
In [10]: data.endog_name
Out[10]: 'TOTEMP'
In [11]: data.exog_name
Out[11]: ['GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']
If the dataset does not have a clear interpretation of what should be an endog and exog, then you can always access the data or raw_data attributes. This is the case for the macrodata dataset, which is a collection of US macroeconomic data rather than a dataset with a specific example in mind. The data attribute contains a record array of the full dataset and the raw_data attribute contains an ndarray with the names of the columns given by the names attribute.
In [12]: type(data.data)
Out[12]: numpy.recarray
In [13]: type(data.raw_data)
Out[13]: numpy.recarray
In [14]: data.names
Out[14]: ['TOTEMP', 'GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']
Loading data as pandas objects¶
For many users it may be preferable to get the datasets as a pandas DataFrame or
Series object. Each of the dataset modules is equipped with a load_pandas
method which returns a Dataset
instance with the data readily available as pandas objects:
In [15]: data = sm.datasets.longley.load_pandas()
In [16]: data.exog
Out[16]:
GNPDEFL GNP UNEMP ARMED POP YEAR
0 83.0 234289.0 2356.0 1590.0 107608.0 1947.0
1 88.5 259426.0 2325.0 1456.0 108632.0 1948.0
2 88.2 258054.0 3682.0 1616.0 109773.0 1949.0
3 89.5 284599.0 3351.0 1650.0 110929.0 1950.0
4 96.2 328975.0 2099.0 3099.0 112075.0 1951.0
5 98.1 346999.0 1932.0 3594.0 113270.0 1952.0
6 99.0 365385.0 1870.0 3547.0 115094.0 1953.0
7 100.0 363112.0 3578.0 3350.0 116219.0 1954.0
8 101.2 397469.0 2904.0 3048.0 117388.0 1955.0
9 104.6 419180.0 2822.0 2857.0 118734.0 1956.0
10 108.4 442769.0 2936.0 2798.0 120445.0 1957.0
11 110.8 444546.0 4681.0 2637.0 121950.0 1958.0
12 112.6 482704.0 3813.0 2552.0 123366.0 1959.0
13 114.2 502601.0 3931.0 2514.0 125368.0 1960.0
14 115.7 518173.0 4806.0 2572.0 127852.0 1961.0
15 116.9 554894.0 4007.0 2827.0 130081.0 1962.0
In [17]: data.endog
Out[17]:
0 60323.0
1 61122.0
2 60171.0
3 61187.0
4 63221.0
5 63639.0
6 64989.0
7 63761.0
8 66019.0
9 67857.0
10 68169.0
11 66513.0
12 68655.0
13 69564.0
14 69331.0
15 70551.0
Name: TOTEMP, dtype: float64
The full DataFrame is available in the data
attribute of the Dataset object
In [18]: data.data
Out[18]:
TOTEMP GNPDEFL GNP UNEMP ARMED POP YEAR
0 60323.0 83.0 234289.0 2356.0 1590.0 107608.0 1947.0
1 61122.0 88.5 259426.0 2325.0 1456.0 108632.0 1948.0
2 60171.0 88.2 258054.0 3682.0 1616.0 109773.0 1949.0
3 61187.0 89.5 284599.0 3351.0 1650.0 110929.0 1950.0
4 63221.0 96.2 328975.0 2099.0 3099.0 112075.0 1951.0
5 63639.0 98.1 346999.0 1932.0 3594.0 113270.0 1952.0
6 64989.0 99.0 365385.0 1870.0 3547.0 115094.0 1953.0
7 63761.0 100.0 363112.0 3578.0 3350.0 116219.0 1954.0
8 66019.0 101.2 397469.0 2904.0 3048.0 117388.0 1955.0
9 67857.0 104.6 419180.0 2822.0 2857.0 118734.0 1956.0
10 68169.0 108.4 442769.0 2936.0 2798.0 120445.0 1957.0
11 66513.0 110.8 444546.0 4681.0 2637.0 121950.0 1958.0
12 68655.0 112.6 482704.0 3813.0 2552.0 123366.0 1959.0
13 69564.0 114.2 502601.0 3931.0 2514.0 125368.0 1960.0
14 69331.0 115.7 518173.0 4806.0 2572.0 127852.0 1961.0
15 70551.0 116.9 554894.0 4007.0 2827.0 130081.0 1962.0
With pandas integration in the estimation classes, the metadata will be attached to model results:
Extra Information¶
If you want to know more about the dataset itself, you can access the following, again using the Longley dataset as an example
>>> dir(sm.datasets.longley)[:6]
['COPYRIGHT', 'DESCRLONG', 'DESCRSHORT', 'NOTE', 'SOURCE', 'TITLE']
Additional information¶
- The idea for a datasets package was originally proposed by David Cournapeau and can be found here with updates by Skipper Seabold.
- To add datasets, see the notes on adding a dataset.