NumPy¶

import numpy as np
import wooldridge as woo
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import pickle

Regresión Simple¶

\[y = \beta_0 +\beta_1 x + u\]

\[\hat\beta_0 = \bar{y}-\hat\beta_1 \bar{x}\]

\[\hat\beta_1= \frac{Cov(x,y)}{Var(x)}\]

Wooldridge 2016, ejemplo-2-3¶

Cargamos las regresiones¶

archi_o = open(r'../Lab8/regsmf.pkl', 'rb')
reg_smf = pickle.load(archi_o)
archi_o.close()

reg_smf.fit().summary()

OLS Regression Results
Dep. Variable:	salary	R-squared:	0.013
Model:	OLS	Adj. R-squared:	0.008
Method:	Least Squares	F-statistic:	2.767
Date:	Fri, 21 May 2021	Prob (F-statistic):	0.0978
Time:	07:48:57	Log-Likelihood:	-1804.5
No. Observations:	209	AIC:	3613.
Df Residuals:	207	BIC:	3620.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	963.1913	213.240	4.517	0.000	542.790	1383.592
roe	18.5012	11.123	1.663	0.098	-3.428	40.431

Omnibus:	311.096	Durbin-Watson:	2.105
Prob(Omnibus):	0.000	Jarque-Bera (JB):	31120.902
Skew:	6.915	Prob(JB):	0.00
Kurtosis:	61.158	Cond. No.	43.3

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Bondad de ajuste¶

Suma Total de Cuadrados

\[ \mathrm{STC} \equiv \sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2} \]

Suma Explicada de Cuadrados

\[ \mathrm{SEC} \equiv \sum_{i=1}^{n}\left(\hat{y}_{i}-\bar{y}\right)^{2} \]

Suma Residual de Cuadrados

\[ \mathrm{SRC} \equiv \sum_{i=1}^{n} \hat{u}_{i}^{2} \]

Bondad de ajuste

\[ R^{2} \equiv \mathrm{SEC} / \mathrm{STC}=1-\mathrm{SRC} / \mathrm{STC} \]

y = reg_smf.endog
x = reg_smf.exog[:,1]

ym = np.mean(y)
yh = reg_smf.fit().fittedvalues
yh.head()
# imprimir tipos de datos

  1224.058071
  1164.854261
  1397.969216
  1072.348338
  1218.507712
dtype: float64

yh = reg_smf.fit().predict(exog=dict(roe=x))
yh.head()

  1224.058071
  1164.854261
  1397.969216
  1072.348338
  1218.507712
dtype: float64

stc = np.sum(np.power(y-ym,2))
stc

391732982.00956935

sec = np.sum(np.power(yh-ym,2))
sec

5166419.039866708

uh = y - yh   # residuales

src = np.sum(uh ** 2)
src

386566562.96970266

stc == sec + src

True

r2 = 1 - src/stc
print("Bondad de ajuste",r2)

Bondad de ajuste 0.01318862408103405

Errores estándar¶

Estimador de la varianza \(\sigma^2\) de los errores \(u_i\)

\[ \hat{\sigma}^{2}=\frac{1}{(n-2)} \sum_{i=1}^{n} \hat{u}_{i}^{2}=\operatorname{SRC} /(n-2) \]

Error estándar de la regresion (EER, error estandar de la estimación, raí< del error cuadratico medio) es la estimación de la desviación estandar \(\sigma\) de los errores \(u_i\)

\[\hat\sigma = \sqrt{\hat{\sigma}^{2}}=\sqrt{\frac{1}{(n-2)} \sum_{i=1}^{n} \hat{u}_{i}^{2}}=\sqrt{\operatorname{SRC} /(n-2)}\]

Error estándar de \(\hat\beta_0\), es el estimador de la desviación estándar de \(\hat\beta_0\)

\[ ee(\beta_0)=\hat{\operatorname{de}}\left(\hat{\beta}_{0}\right)=\hat{\sigma} \sqrt{\frac{\mathrm{SC}_x}{n\mathrm{STC}_x}}=\hat{\sigma} \sqrt{\frac{\sum_{i=1}^{n}x_i^2}{n\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}} \]

Error estándar de \(\hat\beta_1\), es el estimador de la desviación estándar de \(\hat\beta_1\)

\[ ee(\beta_1)=\hat{\operatorname{de}}\left(\hat{\beta}_{1}\right)=\hat{\sigma} / \sqrt{\mathrm{STC}_x}=\hat{\sigma} /\sqrt{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}} \]

n = reg_smf.nobs
varhu = src/(n-2)
varhu

1867471.3186942157

eer = np.sqrt(varhu)
eer

1366.5545428903363

xm = np.mean(x)
stcx = np.sum(np.power(x-xm,2))
scx = np.sum(np.power(x,2))
eeb0 = eer*np.sqrt(scx/(n*stcx))
eeb0

213.24025690501887

eeb1 = eer/np.sqrt(stcx)
eeb1

11.123250903287637

Estadísticos¶

Bajo las condiciones adecuadas se tiene que:

\[ \left(\hat{\beta}_{j}-\beta_{j}\right) / \operatorname{ee}\left(\hat{\beta}_{j}\right) \sim t_{n-2},\quad j\in\{0,1\} \]

Se pueden plantear las siguientes pruebas de hipótesis

\[H_0:\beta_j=0 \qquad H_a: \beta_j\neq 0\]

Para cada muestra de datos \((x,y)_m\) se tienen las estimaciones \(\hat\beta_j\) de \(\beta_j\), es por ello que se habla de la distribución de la variable aleatoria indicada.

Para una muestra dada, si aplicamos la hipótesis nula

\[\hat{\beta}_{j}/ \operatorname{ee}\left(\hat{\beta}_{j}\right)=t_{\hat\beta_j}\]

regresion = reg_smf.fit()
b0=regresion.params[0]
tb0 = b0/eeb0
tb0

4.516930107158806

b1=regresion.params[1]
tb1 = b1/eeb1
tb1

1.66328949208065

Valor \(p\)¶

\[p_{\hat\beta_j}\text{-value}=P(|t_{n-2}|> |t_{\hat\beta_j}|) = P(t_{n-2}<-|t_{\hat\beta_j}|\cup |t_{\hat\beta_j}|<t_{n-2})=P(t_{n-2}<-|t_{\hat\beta_j}|)+P(|t_{\hat\beta_j}|<t_{n-2})\]

Para ello necesitamos el submódulo de scipy: stats, en este caso la distribución \(t\) tiene la siguiente descripción

\[ f(x, \nu)=\frac{\Gamma((\nu+1) / 2)}{\sqrt{\pi \nu} \Gamma(\nu / 2)}\left(1+x^{2} / \nu\right)^{-(\nu+1) / 2} \]

donde \(\nu\) son los grados de libertad, y t.pdf(x, df, loc, scale), con la sigeuinte tranformación y = (x - loc) / scale.

from scipy.stats import t

df = 207

rv = t(df)

pb0 = rv.cdf(-tb0) + (1-rv.cdf(tb0)) # argumentos de cdf: loc = 0, scale = 1 
pb0

1.0533519915197438e-05

pb1 = rv.cdf(-tb1) + (1-rv.cdf(tb1))
pb1

0.09776774891928593

regresion.pvalues

Intercept    0.000011
roe          0.097768
dtype: float64

Introducción a la ciencia de datos