Generalized Linear Models (Formula) ===================================== .. _glm_formula_notebook: `Link to Notebook GitHub <https://github.com/statsmodels/statsmodels/blob/master/examples/notebooks/glm_formula.ipynb>`_ .. raw:: html <div class="cell border-box-sizing text_cell rendered"> <div class="prompt input_prompt"> </div> <div class="inner_cell"> <div class="text_cell_render border-box-sizing rendered_html"> <p>This notebook illustrates how you can use R-style formulas to fit Generalized Linear Models.</p> <p>To begin, we load the <code>Star98</code> dataset and we construct a formula and pre-process the data:</p> </div> </div> </div> <div class="cell border-box-sizing code_cell rendered"> <div class="input"> <div class="prompt input_prompt">In [ ]:</div> <div class="inner_cell"> <div class="input_area"> <div class=" highlight hl-ipython3"><pre><span class="kn">from</span> <span class="nn">__future__</span> <span class="k">import</span> <span class="n">print_function</span> <span class="kn">import</span> <span class="nn">statsmodels.api</span> <span class="k">as</span> <span class="nn">sm</span> <span class="kn">import</span> <span class="nn">statsmodels.formula.api</span> <span class="k">as</span> <span class="nn">smf</span> <span class="n">star98</span> <span class="o">=</span> <span class="n">sm</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">star98</span><span class="o">.</span><span class="n">load_pandas</span><span class="p">()</span><span class="o">.</span><span class="n">data</span> <span class="n">formula</span> <span class="o">=</span> <span class="s">'SUCCESS ~ LOWINC + PERASIAN + PERBLACK + PERHISP + PCTCHRT + </span><span class="se">\</span> <span class="s"> PCTYRRND + PERMINTE*AVYRSEXP*AVSALK + PERSPENK*PTRATIO*PCTAF'</span> <span class="n">dta</span> <span class="o">=</span> <span class="n">star98</span><span class="p">[[</span><span class="s">'NABOVE'</span><span class="p">,</span> <span class="s">'NBELOW'</span><span class="p">,</span> <span class="s">'LOWINC'</span><span class="p">,</span> <span class="s">'PERASIAN'</span><span class="p">,</span> <span class="s">'PERBLACK'</span><span class="p">,</span> <span class="s">'PERHISP'</span><span class="p">,</span> <span class="s">'PCTCHRT'</span><span class="p">,</span> <span class="s">'PCTYRRND'</span><span class="p">,</span> <span class="s">'PERMINTE'</span><span class="p">,</span> <span class="s">'AVYRSEXP'</span><span class="p">,</span> <span class="s">'AVSALK'</span><span class="p">,</span> <span class="s">'PERSPENK'</span><span class="p">,</span> <span class="s">'PTRATIO'</span><span class="p">,</span> <span class="s">'PCTAF'</span><span class="p">]]</span> <span class="n">endog</span> <span class="o">=</span> <span class="n">dta</span><span class="p">[</span><span class="s">'NABOVE'</span><span class="p">]</span> <span class="o">/</span> <span class="p">(</span><span class="n">dta</span><span class="p">[</span><span class="s">'NABOVE'</span><span class="p">]</span> <span class="o">+</span> <span class="n">dta</span><span class="o">.</span><span class="n">pop</span><span class="p">(</span><span class="s">'NBELOW'</span><span class="p">))</span> <span class="k">del</span> <span class="n">dta</span><span class="p">[</span><span class="s">'NABOVE'</span><span class="p">]</span> <span class="n">dta</span><span class="p">[</span><span class="s">'SUCCESS'</span><span class="p">]</span> <span class="o">=</span> <span class="n">endog</span> </pre></div> </div> </div> </div> </div> <div class="cell border-box-sizing text_cell rendered"> <div class="prompt input_prompt"> </div> <div class="inner_cell"> <div class="text_cell_render border-box-sizing rendered_html"> <p>Then, we fit the GLM model:</p> </div> </div> </div> <div class="cell border-box-sizing code_cell rendered"> <div class="input"> <div class="prompt input_prompt">In [ ]:</div> <div class="inner_cell"> <div class="input_area"> <div class=" highlight hl-ipython3"><pre><span class="n">mod1</span> <span class="o">=</span> <span class="n">smf</span><span class="o">.</span><span class="n">glm</span><span class="p">(</span><span class="n">formula</span><span class="o">=</span><span class="n">formula</span><span class="p">,</span> <span class="n">data</span><span class="o">=</span><span class="n">dta</span><span class="p">,</span> <span class="n">family</span><span class="o">=</span><span class="n">sm</span><span class="o">.</span><span class="n">families</span><span class="o">.</span><span class="n">Binomial</span><span class="p">())</span><span class="o">.</span><span class="n">fit</span><span class="p">()</span> <span class="n">mod1</span><span class="o">.</span><span class="n">summary</span><span class="p">()</span> </pre></div> </div> </div> </div> <div class="output_wrapper"> <div class="output"> <div class="output_area"><div class="prompt"></div> <div class="output_subarea output_stream output_stderr output_text"> <pre>/Users/tom.augspurger/Envs/py3/lib/python3.4/site-packages/IPython/kernel/__main__.py:11: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy </pre> </div> </div> </div> </div> </div> <div class="cell border-box-sizing text_cell rendered"> <div class="prompt input_prompt"> </div> <div class="inner_cell"> <div class="text_cell_render border-box-sizing rendered_html"> <p>Finally, we define a function to operate customized data transformation using the formula framework:</p> </div> </div> </div> <div class="cell border-box-sizing code_cell rendered"> <div class="input"> <div class="prompt input_prompt">In [ ]:</div> <div class="inner_cell"> <div class="input_area"> <div class=" highlight hl-ipython3"><pre><span class="k">def</span> <span class="nf">double_it</span><span class="p">(</span><span class="n">x</span><span class="p">):</span> <span class="k">return</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">x</span> <span class="n">formula</span> <span class="o">=</span> <span class="s">'SUCCESS ~ double_it(LOWINC) + PERASIAN + PERBLACK + PERHISP + PCTCHRT + </span><span class="se">\</span> <span class="s"> PCTYRRND + PERMINTE*AVYRSEXP*AVSALK + PERSPENK*PTRATIO*PCTAF'</span> <span class="n">mod2</span> <span class="o">=</span> <span class="n">smf</span><span class="o">.</span><span class="n">glm</span><span class="p">(</span><span class="n">formula</span><span class="o">=</span><span class="n">formula</span><span class="p">,</span> <span class="n">data</span><span class="o">=</span><span class="n">dta</span><span class="p">,</span> <span class="n">family</span><span class="o">=</span><span class="n">sm</span><span class="o">.</span><span class="n">families</span><span class="o">.</span><span class="n">Binomial</span><span class="p">())</span><span class="o">.</span><span class="n">fit</span><span class="p">()</span> <span class="n">mod2</span><span class="o">.</span><span class="n">summary</span><span class="p">()</span> </pre></div> </div> </div> </div> <div class="output_wrapper"> <div class="output"> <div class="output_area"><div class="prompt"></div> </div> </div> </div> </div> <div class="cell border-box-sizing text_cell rendered"> <div class="prompt input_prompt"> </div> <div class="inner_cell"> <div class="text_cell_render border-box-sizing rendered_html"> <p>As expected, the coefficient for <code>double_it(LOWINC)</code> in the second model is half the size of the <code>LOWINC</code> coefficient from the first model:</p> </div> </div> </div> <div class="cell border-box-sizing code_cell rendered"> <div class="input"> <div class="prompt input_prompt">In [ ]:</div> <div class="inner_cell"> <div class="input_area"> <div class=" highlight hl-ipython3"><pre><span class="nb">print</span><span class="p">(</span><span class="n">mod1</span><span class="o">.</span><span class="n">params</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="nb">print</span><span class="p">(</span><span class="n">mod2</span><span class="o">.</span><span class="n">params</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">*</span> <span class="mi">2</span><span class="p">)</span> </pre></div> </div> </div> </div> <div class="output_wrapper"> <div class="output"> <div class="output_area"><div class="prompt"></div> </div> </div> </div> </div> <script src="https://c328740.ssl.cf1.rackcdn.com/mathjax/latest/MathJax.js?config=TeX-AMS_HTML"type="text/javascript"></script> <script type="text/javascript"> init_mathjax = function() { if (window.MathJax) { // MathJax loaded MathJax.Hub.Config({ tex2jax: { // I'm not sure about the \( and \[ below. It messes with the // prompt, and I think it's an issue with the template. -SS inlineMath: [ ['$','$'], ["\\(","\\)"] ], displayMath: [ ['$$','$$'], ["\\[","\\]"] ] }, displayAlign: 'left', // Change this to 'center' to center equations. "HTML-CSS": { styles: {'.MathJax_Display': {"margin": 0}} } }); MathJax.Hub.Queue(["Typeset",MathJax.Hub]); } } init_mathjax(); // since we have to load this in a ..raw:: directive we will add the css // after the fact function loadcssfile(filename){ var fileref=document.createElement("link") fileref.setAttribute("rel", "stylesheet") fileref.setAttribute("type", "text/css") fileref.setAttribute("href", filename) document.getElementsByTagName("head")[0].appendChild(fileref) } // loadcssfile({{pathto("_static/nbviewer.pygments.css", 1) }}) // loadcssfile({{pathto("_static/nbviewer.min.css", 1) }}) loadcssfile("../../../_static/nbviewer.pygments.css") loadcssfile("../../../_static/ipython.min.css") </script>