Class07¶
Regression¶
Python¶
Import packages
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
import statsmodels.api as sm
from linearmodels import PanelOLS
Import data
url = 'https://www.dropbox.com/s/uso1u9asqam7rp1/merged.dta?dl=1'
data = pd.read_stata(url)
data['yyyymm'] = data['yyyymm'].astype(int)
data = data.sort_values(['cusip', 'yyyymm'], ignore_index=True)
Prepare regression data
data['ret'] = pd.to_numeric(data['ret'], errors='coerce')
data['year'] = (data['yyyymm']/100).astype(int)
data = data.drop_duplicates(['cusip', 'year']).copy()
data['lnme'] = np.log(data['me'])
data = data.sort_values(['cusip', 'year'], ignore_index=True)
data['lag_lnme'] = data.groupby('cusip')['lnme'].shift(1)
data['lag_year'] = data.groupby('cusip')['year'].shift(1)
data['year_diff'] = data['year'] - data['lag_year']
data.loc[data['year_diff']!=1, 'lag_lnme'] = np.nan
data = data.dropna(subset=['lag_lnme', 'ret'], how='any')
data = data[['cusip', 'year', 'ret', 'lag_lnme']]
OLS regression
est = sm.OLS(data['ret'], sm.add_constant(data['lag_lnme'])).fit()
est.summary()
# robust standard error
(sm.OLS(data['ret'], sm.add_constant(data['lag_lnme']))
.fit(cov_type='hc0', use_t=True).summary())
Dep. Variable: | ret | R-squared: | 0.001 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.001 |
Method: | Least Squares | F-statistic: | 45.41 |
Date: | Wed, 16 Mar 2022 | Prob (F-statistic): | 1.61e-11 |
Time: | 22:05:10 | Log-Likelihood: | 29162. |
No. Observations: | 60822 | AIC: | -5.832e+04 |
Df Residuals: | 60820 | BIC: | -5.830e+04 |
Df Model: | 1 | ||
Covariance Type: | hc0 |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 0.0224 | 0.002 | 9.586 | 0.000 | 0.018 | 0.027 |
lag_lnme | -0.0022 | 0.000 | -6.739 | 0.000 | -0.003 | -0.002 |
Omnibus: | 51007.444 | Durbin-Watson: | 1.971 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 7467699.262 |
Skew: | 3.364 | Prob(JB): | 0.00 |
Kurtosis: | 56.865 | Cond. No. | 20.4 |
Notes:
[1] Standard Errors are heteroscedasticity robust (HC0)
Fitted values and residuals
data['a'] = est.params[0]
data['b'] = est.params[1]
data['p_calc'] = data['a'] + data['b']*data['lag_lnme']
data['p_est'] = est.predict()
data['e_calc'] = data['ret'] - data['p_calc']
data['e_est'] = est.resid
Panel regression
data1 = data.set_index(['cusip', 'year'])
panel_est = (PanelOLS(data1['ret'], sm.add_constant(data1['lag_lnme']),
entity_effects=True).fit())
panel_est.summary
Dep. Variable: | ret | R-squared: | 0.0178 |
---|---|---|---|
Estimator: | PanelOLS | R-squared (Between): | -0.2650 |
No. Observations: | 60822 | R-squared (Within): | 0.0178 |
Date: | Wed, Mar 16 2022 | R-squared (Overall): | -0.1220 |
Time: | 22:05:11 | Log-likelihood | 3.549e+04 |
Cov. Estimator: | Unadjusted | ||
F-statistic: | 966.69 | ||
Entities: | 7407 | P-value | 0.0000 |
Avg Obs: | 8.2114 | Distribution: | F(1,53414) |
Min Obs: | 1.0000 | ||
Max Obs: | 17.000 | F-statistic (robust): | 966.69 |
P-value | 0.0000 | ||
Time periods: | 17 | Distribution: | F(1,53414) |
Avg Obs: | 3577.8 | ||
Min Obs: | 3156.0 | ||
Max Obs: | 4437.0 | ||
Parameter | Std. Err. | T-stat | P-value | Lower CI | Upper CI | |
---|---|---|---|---|---|---|
const | 0.1768 | 0.0054 | 32.602 | 0.0000 | 0.1662 | 0.1874 |
lag_lnme | -0.0276 | 0.0009 | -31.092 | 0.0000 | -0.0293 | -0.0258 |
F-test for Poolability: 1.6684
P-value: 0.0000
Distribution: F(7406,53414)
Included effects: Entity
Panel regression with robust standard error
panel_est = (PanelOLS(data1['ret'], sm.add_constant(data1['lag_lnme']),
entity_effects=True).fit(cov_type='robust'))
panel_est.summary
Dep. Variable: | ret | R-squared: | 0.0178 |
---|---|---|---|
Estimator: | PanelOLS | R-squared (Between): | -0.2650 |
No. Observations: | 60822 | R-squared (Within): | 0.0178 |
Date: | Wed, Mar 16 2022 | R-squared (Overall): | -0.1220 |
Time: | 22:05:11 | Log-likelihood | 3.549e+04 |
Cov. Estimator: | Robust | ||
F-statistic: | 966.69 | ||
Entities: | 7407 | P-value | 0.0000 |
Avg Obs: | 8.2114 | Distribution: | F(1,53414) |
Min Obs: | 1.0000 | ||
Max Obs: | 17.000 | F-statistic (robust): | 541.47 |
P-value | 0.0000 | ||
Time periods: | 17 | Distribution: | F(1,53414) |
Avg Obs: | 3577.8 | ||
Min Obs: | 3156.0 | ||
Max Obs: | 4437.0 | ||
Parameter | Std. Err. | T-stat | P-value | Lower CI | Upper CI | |
---|---|---|---|---|---|---|
const | 0.1768 | 0.0073 | 24.053 | 0.0000 | 0.1624 | 0.1912 |
lag_lnme | -0.0276 | 0.0012 | -23.270 | 0.0000 | -0.0299 | -0.0253 |
F-test for Poolability: 1.6684
P-value: 0.0000
Distribution: F(7406,53414)
Included effects: Entity
Attention
linearmodels robust standard error is different from Stata robust standard error
Panel regression with clustered standard error
panel_est = (PanelOLS(data1['ret'], sm.add_constant(data1['lag_lnme']),
entity_effects=True) .fit(cov_type='clustered', cluster_entity=True))
panel_est.summary
Dep. Variable: | ret | R-squared: | 0.0178 |
---|---|---|---|
Estimator: | PanelOLS | R-squared (Between): | -0.2650 |
No. Observations: | 60822 | R-squared (Within): | 0.0178 |
Date: | Wed, Mar 16 2022 | R-squared (Overall): | -0.1220 |
Time: | 22:05:11 | Log-likelihood | 3.549e+04 |
Cov. Estimator: | Clustered | ||
F-statistic: | 966.69 | ||
Entities: | 7407 | P-value | 0.0000 |
Avg Obs: | 8.2114 | Distribution: | F(1,53414) |
Min Obs: | 1.0000 | ||
Max Obs: | 17.000 | F-statistic (robust): | 509.43 |
P-value | 0.0000 | ||
Time periods: | 17 | Distribution: | F(1,53414) |
Avg Obs: | 3577.8 | ||
Min Obs: | 3156.0 | ||
Max Obs: | 4437.0 | ||
Parameter | Std. Err. | T-stat | P-value | Lower CI | Upper CI | |
---|---|---|---|---|---|---|
const | 0.1768 | 0.0074 | 23.806 | 0.0000 | 0.1622 | 0.1913 |
lag_lnme | -0.0276 | 0.0012 | -22.571 | 0.0000 | -0.0300 | -0.0252 |
F-test for Poolability: 1.6684
P-value: 0.0000
Distribution: F(7406,53414)
Included effects: Entity
Attention
linearmodels clustered entity standard error is the same with Stata robust standard error
Stata¶
Import data
local data_url "https://www.dropbox.com/s/uso1u9asqam7rp1/merged.dta?dl=1"
use "`data_url'", clear
Prepare regression data
rename ret ret_str
gen ret = real(ret_str)
drop ret_str
gen year = int(yyyymm/100)
duplicates drop cusip year, force
gen lnme = ln(me)
sort cusip year
by cusip: gen lag_lnme = lnme[_n-1]
by cusip: gen lag_year = year[_n-1]
gen year_diff = year - lag_year
replace lag_lnme = . if year_diff!=1
OLS regression
reg ret lag_lnme
reg ret lag_lnme, robust
Fitted values and residuals
gen a = _b[_cons]
gen b = _b[lag_lnme]
gen y_calc = a + b*lag_lnme
predict y_est
gen e_calc = ret - y_calc
predict e_est, resid
Panel regression
encode cusip, gen(stock_id)
xtset stock_id year
xtreg ret lag_lnme, fe
Panel regression with robust standard error
xtreg ret lag_lnme, fe robust