Class07

Regression

Python

  • Import packages

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
import statsmodels.api as sm
from linearmodels import PanelOLS
  • Import data

url = 'https://www.dropbox.com/s/uso1u9asqam7rp1/merged.dta?dl=1'
data = pd.read_stata(url)
data['yyyymm'] = data['yyyymm'].astype(int)
data = data.sort_values(['cusip', 'yyyymm'], ignore_index=True)
  • Prepare regression data

data['ret'] = pd.to_numeric(data['ret'], errors='coerce')
data['year'] = (data['yyyymm']/100).astype(int)
data = data.drop_duplicates(['cusip', 'year']).copy()
data['lnme'] = np.log(data['me'])
data = data.sort_values(['cusip', 'year'], ignore_index=True)
data['lag_lnme'] = data.groupby('cusip')['lnme'].shift(1)
data['lag_year'] = data.groupby('cusip')['year'].shift(1)
data['year_diff'] = data['year'] - data['lag_year']
data.loc[data['year_diff']!=1, 'lag_lnme'] = np.nan
data = data.dropna(subset=['lag_lnme', 'ret'], how='any')
data = data[['cusip', 'year', 'ret', 'lag_lnme']]
  • OLS regression

est = sm.OLS(data['ret'], sm.add_constant(data['lag_lnme'])).fit()
est.summary()

# robust standard error
(sm.OLS(data['ret'], sm.add_constant(data['lag_lnme']))
    .fit(cov_type='hc0', use_t=True).summary())
OLS Regression Results
Dep. Variable: ret R-squared: 0.001
Model: OLS Adj. R-squared: 0.001
Method: Least Squares F-statistic: 45.41
Date: Wed, 16 Mar 2022 Prob (F-statistic): 1.61e-11
Time: 22:05:10 Log-Likelihood: 29162.
No. Observations: 60822 AIC: -5.832e+04
Df Residuals: 60820 BIC: -5.830e+04
Df Model: 1
Covariance Type: hc0
coef std err t P>|t| [0.025 0.975]
const 0.0224 0.002 9.586 0.000 0.018 0.027
lag_lnme -0.0022 0.000 -6.739 0.000 -0.003 -0.002
Omnibus: 51007.444 Durbin-Watson: 1.971
Prob(Omnibus): 0.000 Jarque-Bera (JB): 7467699.262
Skew: 3.364 Prob(JB): 0.00
Kurtosis: 56.865 Cond. No. 20.4


Notes:
[1] Standard Errors are heteroscedasticity robust (HC0)
  • Fitted values and residuals

data['a'] = est.params[0]
data['b'] = est.params[1]

data['p_calc'] = data['a'] + data['b']*data['lag_lnme']
data['p_est'] = est.predict()

data['e_calc'] = data['ret'] - data['p_calc']
data['e_est'] = est.resid
  • Panel regression

data1 = data.set_index(['cusip', 'year'])
panel_est = (PanelOLS(data1['ret'], sm.add_constant(data1['lag_lnme']),
    entity_effects=True).fit())
panel_est.summary
PanelOLS Estimation Summary
Dep. Variable: ret R-squared: 0.0178
Estimator: PanelOLS R-squared (Between): -0.2650
No. Observations: 60822 R-squared (Within): 0.0178
Date: Wed, Mar 16 2022 R-squared (Overall): -0.1220
Time: 22:05:11 Log-likelihood 3.549e+04
Cov. Estimator: Unadjusted
F-statistic: 966.69
Entities: 7407 P-value 0.0000
Avg Obs: 8.2114 Distribution: F(1,53414)
Min Obs: 1.0000
Max Obs: 17.000 F-statistic (robust): 966.69
P-value 0.0000
Time periods: 17 Distribution: F(1,53414)
Avg Obs: 3577.8
Min Obs: 3156.0
Max Obs: 4437.0
Parameter Estimates
Parameter Std. Err. T-stat P-value Lower CI Upper CI
const 0.1768 0.0054 32.602 0.0000 0.1662 0.1874
lag_lnme -0.0276 0.0009 -31.092 0.0000 -0.0293 -0.0258


F-test for Poolability: 1.6684
P-value: 0.0000
Distribution: F(7406,53414)

Included effects: Entity
  • Panel regression with robust standard error

panel_est = (PanelOLS(data1['ret'], sm.add_constant(data1['lag_lnme']),
    entity_effects=True).fit(cov_type='robust'))
panel_est.summary
PanelOLS Estimation Summary
Dep. Variable: ret R-squared: 0.0178
Estimator: PanelOLS R-squared (Between): -0.2650
No. Observations: 60822 R-squared (Within): 0.0178
Date: Wed, Mar 16 2022 R-squared (Overall): -0.1220
Time: 22:05:11 Log-likelihood 3.549e+04
Cov. Estimator: Robust
F-statistic: 966.69
Entities: 7407 P-value 0.0000
Avg Obs: 8.2114 Distribution: F(1,53414)
Min Obs: 1.0000
Max Obs: 17.000 F-statistic (robust): 541.47
P-value 0.0000
Time periods: 17 Distribution: F(1,53414)
Avg Obs: 3577.8
Min Obs: 3156.0
Max Obs: 4437.0
Parameter Estimates
Parameter Std. Err. T-stat P-value Lower CI Upper CI
const 0.1768 0.0073 24.053 0.0000 0.1624 0.1912
lag_lnme -0.0276 0.0012 -23.270 0.0000 -0.0299 -0.0253


F-test for Poolability: 1.6684
P-value: 0.0000
Distribution: F(7406,53414)

Included effects: Entity

Attention

linearmodels robust standard error is different from Stata robust standard error

  • Panel regression with clustered standard error

panel_est = (PanelOLS(data1['ret'], sm.add_constant(data1['lag_lnme']),
    entity_effects=True) .fit(cov_type='clustered', cluster_entity=True))
panel_est.summary
PanelOLS Estimation Summary
Dep. Variable: ret R-squared: 0.0178
Estimator: PanelOLS R-squared (Between): -0.2650
No. Observations: 60822 R-squared (Within): 0.0178
Date: Wed, Mar 16 2022 R-squared (Overall): -0.1220
Time: 22:05:11 Log-likelihood 3.549e+04
Cov. Estimator: Clustered
F-statistic: 966.69
Entities: 7407 P-value 0.0000
Avg Obs: 8.2114 Distribution: F(1,53414)
Min Obs: 1.0000
Max Obs: 17.000 F-statistic (robust): 509.43
P-value 0.0000
Time periods: 17 Distribution: F(1,53414)
Avg Obs: 3577.8
Min Obs: 3156.0
Max Obs: 4437.0
Parameter Estimates
Parameter Std. Err. T-stat P-value Lower CI Upper CI
const 0.1768 0.0074 23.806 0.0000 0.1622 0.1913
lag_lnme -0.0276 0.0012 -22.571 0.0000 -0.0300 -0.0252


F-test for Poolability: 1.6684
P-value: 0.0000
Distribution: F(7406,53414)

Included effects: Entity

Attention

linearmodels clustered entity standard error is the same with Stata robust standard error

Stata

  • Import data

local data_url "https://www.dropbox.com/s/uso1u9asqam7rp1/merged.dta?dl=1"
use "`data_url'", clear
  • Prepare regression data

rename ret ret_str
gen ret = real(ret_str)
drop ret_str

gen year = int(yyyymm/100)

duplicates drop cusip year, force

gen lnme = ln(me)

sort cusip year
by cusip: gen lag_lnme = lnme[_n-1]
by cusip: gen lag_year = year[_n-1]

gen year_diff = year - lag_year

replace lag_lnme = . if year_diff!=1
  • OLS regression

reg ret lag_lnme

reg ret lag_lnme, robust
  • Fitted values and residuals

gen a = _b[_cons]
gen b = _b[lag_lnme]

gen y_calc = a + b*lag_lnme
predict y_est

gen e_calc = ret - y_calc
predict e_est, resid
  • Panel regression

encode cusip, gen(stock_id)
xtset stock_id year

xtreg ret lag_lnme, fe
  • Panel regression with robust standard error

xtreg ret lag_lnme, fe robust