# Class03 ## Compustat North America overview - Data: accounting and price data - Coverage: US and Canada - Time period: 1950 - present (Fundamentals Annual) - Frequency - Accounting data: yearly and quarterly - Price data: daily and monthly - Dataset structure - Fundamentals Annual - Fundamentals Quarterly - Security Daily - Security Monthly ### Web access Downlaod data from Compustat North America via WRDS web access. - Dataset: Compustat North Ameria - Fundamentals Annual - Time period: Jan 2000 to Dec 2020 - Search the entire database - Screening variables - Consolidation Level: **C** - Industry Format: **INDL** - Data Format: **STD** - Population Source: **D** - Currency: **USD** - Company Status: **Active (A)** and **Inactive (I)** - Variables - gvkey - cusip - exchg - sic - fyear - at - seq - ebit - Output format - Tab-delimited text - Uncompressed - Date format: YYMMDDn8 ### Industry Format
INDFMT
FS - financial services (includes banks, insurance companies, broker/dealers, real estate and other financial services)
INDL - other than financial services
|gvkey |fyear |indfmt |at |capr1 | |-------|------|-------|----------|------| |007647 |2020 |FS |2819627.0 |13.5 | |007647 |2020 |INDL |2819627.0 | | ### Identifiers
GVKEY
Compustat permanent firm level identifier
They are permanent and never change
CUSIP
Current 9-digit CUSIP (security level)
They can be changed.
CIK
Central Index Key used for SEC filings
```{attention} **Leading zeros of GVKEY** **GVKEY** is 6-digit character in string fromat. However, **GVKEY** contains numbers only and it will be recoganised as numerical values in some software by default. Therefore, the leading zeros will be dropped. For example, GVKEY 001050 might become 1050. This will cause matching problem if one dataset has 6-digit GVKEY and another one has GVKEY without leading zero. ``` ### Datadate vs fiscal year - Datadate is the date of fiscal year end - Fiscal year (fyear) |Calendar Year |Fiscal Year-end Month |Fiscal Year| |--------------|----------------------|-----------| |2019 |January |2018 | |2019 |February |2018 | |2019 |March |2018 | |2019 |April |2018 | |2019 |May |2018 | |2019 |June |2019 | |2019 |July |2019 | |2019 |August |2019 | |2019 |September |2019 | |2019 |October |2019 | |2019 |November |2019 | |2019 |December |2019 | ```{attention} Keep in mind that we do not know the accounting information on **datadate**. The accounting numbers are public avaiable after companies release their financial reports. ```
Fama and French. (1992). Journal of Finance.
Source: http://www.sec.gov/about/forms/form10-k.pdf
Source: https://www.investor.gov/introduction-investing/investing-basics/glossary/form-10-k
### Stock exchange code
EXCHG
11
NYSE
12
AMEX
14
NASDAQ
0
Subsidiary/Private
19
OTC
See other stock exchange codes here ## Python - Import packages ```{jupyter-execute} import pandas as pd import numpy as np ```` - Import Compustat North America raw data ```{jupyter-execute} url = 'https://www.dropbox.com/s/qro99n7ngq0aztr/funda_raw.txt?dl=1' funda_raw = pd.read_csv(url, sep='\t', low_memory=False) # View first 5 rows and first 5 columns funda_raw.iloc[:5, :5] ``` ```{jupyter-execute} # Data types funda_raw.info() ``` ```{jupyter-execute} # Read the data again with consideration of the data type of gvkey funda_raw = pd.read_csv(url, sep='\t', low_memory=False, dtype={'gvkey': str}) # View first 5 rows and first 5 columns funda_raw.iloc[:5, :5] ``` ```{jupyter-execute} # Data types funda_raw.info() ``` - Clean dataset ```{jupyter-execute} funda_raw = funda_raw.sort_values(['gvkey', 'fyear', 'datadate'], ignore_index=True) print(f'Number of obs: {len(funda_raw)}') # Keep the most recent one if there are duplicates funda = funda_raw.drop_duplicates(['gvkey', 'fyear'], keep='last').copy() print(f'Number of obs after removing duplicates: {len(funda)}') # Keep main stock exchanges funda = funda[funda['exchg'].isin([11, 12, 14])].copy() print(f'Number of obs after keeping main stock exchanges: {len(funda)}') # Drop unnecessary variables drop_vars = ['indfmt', 'consol', 'popsrc', 'datafmt', 'costat', 'curcd'] funda = funda.drop(columns=drop_vars) # Drop if total asset and equity are not positive funda = funda.query('at>0 & seq>0').copy() print(f'Number of obs after requiring positive asset: {len(funda)}') ``` **API reference**: pandas.dataframe.sort_values - Firm-year obs in each stock exchange ```{jupyter-execute} funda['exchg'].value_counts() ``` - Seasonality of fiscal year end ```{jupyter-execute} funda['date'] = pd.to_datetime(funda['datadate'], format='%Y%m%d') funda['month'] = funda['date'].dt.month funda['month'].value_counts() ``` - ROA and ROE ```{jupyter-execute} funda['roa'] = funda['ebit'] / funda['at'] funda['roe'] = funda['ebit'] / funda['seq'] ``` - Asset growth - Generate lagged variables and calculate growth rate ```{jupyter-execute} funda = funda.sort_values(['gvkey', 'fyear'], ignore_index=True) funda['lag_at'] = funda.groupby('gvkey')['at'].shift(1) funda['lag_fyear'] = funda.groupby('gvkey')['fyear'].shift(1) funda['ag'] = funda['at'] / funda['lag_at'] - 1 ``` - Check year gap ```{jupyter-execute} funda['fyear_diff'] = funda['fyear'] - funda['lag_fyear'] funda.query('fyear_diff>1')[['gvkey', 'fyear', 'lag_fyear', 'fyear_diff', 'ag']].head() ``` - Deal with gap ```{jupyter-execute} funda.loc[funda['fyear_diff']>1, 'ag'] = np.nan ``` - Summary statistics ```{jupyter-execute} round(funda[['roa', 'roe', 'ag']].describe(), 3) ``` - Deal with outliers ```{jupyter-execute} # Winsorising at 1% and 99% percentile for i in ['roa', 'roe', 'ag']: funda['p1'] = funda.groupby('fyear')[i].transform(lambda x: x.quantile(0.01)) funda['p99'] = funda.groupby('fyear')[i].transform(lambda x: x.quantile(0.99)) funda.loc[funda[i]funda['p99'], i] = funda['p99'] funda = funda.drop(columns=['p1', 'p99']) round(funda[['roa', 'roe', 'ag']].describe(), 3) ``` ## Stata - Import data from a url ```{code-block} stata local data_url "https://www.dropbox.com/s/qro99n7ngq0aztr/funda_raw.txt?dl=1" import delimited "`data_url'", clear ``` - List first five observations ```{code-block} stata l gvkey datadate at seq ebit in 1/5 ``` - Data types ```{code-block} stata describe ``` - Specify data type when importing data ```{code-block} stata local data_url "https://www.dropbox.com/s/qro99n7ngq0aztr/funda_raw.txt?dl=1" import delimited "`data_url'", stringcols(1) clear describe l gvkey datadate at seq ebit in 1/5 ``` - Keep the most recent one if there are duplicates ```{code-block} stata sort gvkey fyear datadate by gvkey fyear: gen id = _n by gvkey fyear: gen n = _N keep if id == n ``` - Keep main stock exchanges ```{code-block} stata keep if exchg==11 | exchg==12 | exchg==14 ``` - Drop unnecessary variables ```{code-block} stata drop indfmt consol popsrc datafmt costat curcd ``` - Drop if total asset and equity are not positive ```{code-block} stata // Pay attention: // missing value (.) is larger than any other values // at>0 does not remove missing values // seq>0 does not remove missing values // This is different from Python keep if at>0 & seq>0 & at!=. & seq!=. ``` - Firm-year observations in each stock exchange ```{code-block} stata tab exchg ``` - Seasonality of fiscal year end ```{code-block} stata tostring datadate, replace gen date = date(datadate, "YMD") format date %td gen month = month(date) tab month ``` - ROA and ROE ```{code-block} stata gen roa = ebit / at gen roe = ebit / seq ``` - Asset growth ```{code-block} stata sort gvkey fyear // Generate lagged variables and calculate growth rate by gvkey: gen lag_at = at[_n-1] by gvkey: gen lag_fyear = fyear[_n-1] gen ag = at / lag_at - 1 // Check year gap gen fyear_diff = fyear - lag_fyear // Deal with gap replace ag = . if fyear_diff>1 ``` - Summary statistics ```{code-block} stata summ roa roe ag ``` - Deal with outliers ```{code-block} stata foreach i of varlist roa roe ag { bysort fyear: egen p1 = pctile(`i'), p(1) bysort fyear: egen p99 = pctile(`i'), p(99) replace `i' = p1 if `i' < p1 replace `i' = p99 if `i' > p99 drop p1 p99 } sort gvkey fyear summ roa roe ag ```