# Class02 ## CRSP overview - Data: security price (and many other trading related data) - Coverage: US stock exchanges - Time period: 1925 - present - Frequency: daily and monthly ### Web access Downlaod data from CRSP via WRDS web access. - Dataset: CRSP monthly stock file - Time period: Jan 2011 to Dec 2020 - Search the entire database - Variables - shrcd - exchcd - siccd - prc - ret - shrout - cfacpr - permco - cusip - ncusip - Output format - Tab-delimited text - Uncompressed - Date format: YYMMDDn8 ### Identifiers

PERMNO: CRSP permanent security level identifier; They are permanent and never change
PERMCO: CRSP permanent firm level identifier; One PERMCO might have different PERMNO because a company can issue different securities. For example, Alphabet (Google) has class A and class C shares.; They are permanent and never change
CUSIP: Current 8-digit CUSIP (security level); They can be changed. And we can use NCUSIP to track the change history.
NCUSIP: Historical 8-digit CUSIP (security level)

```{note} TICKER and company name (COMNAM) are also available identifiers in CRSP. However, we usually do not use them to match stocks/firms unless we have no other widely used identifiers. ``` ### Stock exchange code

1: NYSE
2: AMEX
3: NASDAQ

Chinco, Neuhierl and Weber. (2021). Journal of Financial Economics.

Lee and Swaminathan. (2000). Journal of Finance.

See other stock exchange codes here ### Share code

10 or 11: Common stocks

Antoniou, Doukas and Subrahmanyam. (2016). Management Science.

Han, Huang, Huang and Zhou. (forthcoming). Journal of Financial Economics.

See other share codes here ### Stock price

PRC: Stock price (unadjusted); Some stock prices have negative sign (-); Need to take absolute value of price

- $\text{adjusted price} = \frac{\left|prc\right|}{cfacpr}$ - It is important to adjust stock price when evaluating stock performance Take the example of Microsoft stock price. The vertical red lines indicate corporate actions (e.g. stock splits, dividends). Clearly, the stock return is not correct if unadjusted price is applied. ![](https://raw.githubusercontent.com/mk0417/image-lib/master/adjprc.png) ```{note} RET (stock return) in CRSP is already adjusted. You can download it and there is no need to do any adjustment. ``` ### Market value $\text{market value (millions)} = \frac{|prc| \times shrout}{1000}$

SHROUT: number of shares outstanding (in 1,000 shares)

### Volume

VOL: Number of traded shares; 1 share in Daily Stock File; 100 shares in Monthly Stock File

## Python - Import packages ```{jupyter-execute} import pandas as pd import numpy as np ``` - Import CRSP raw data - Import data from local directory ```{code-block} python # Please make sure to change the path to where you save your data file_path = '/Users/ml/Dropbox/teaching/data/crsp_month_raw.txt' crsp_raw = pd.read_csv(file_path, sep='\t', low_memory=False) # View first 5 rows # crsp_raw.head() ```` - Import data from a url ```{jupyter-execute} # This url contains sample data from CRSP url = 'https://www.dropbox.com/s/6mk86g97uji2f80/crsp_month_raw.txt?dl=1' crsp_raw = pd.read_csv(url, sep='\t', low_memory=False) # View first 5 rows and first 5 columns crsp_raw.iloc[:5, :5] ```` **API reference**: pandas.read_csv
```{attention} **Comments** - Any lines starting with `#` will be treated as comments and Python will not execute the lines. - This is useful when you need to write some comments or notes **Folder path and operating system** - Use **pathlib** library is a better solution if you want to run your codes without problem on different operating systems. - Import the package: `from pathlib import Path` - MacOS (Linux) - Forward slash: **/folder/file.txt** - Windows - Backslash: **C:\folder\file.txt** - Please use double slashes when typing path: `C:\\folder\\file.txt` - Check the folder path tips ``` - Data dimension ```{jupyter-execute} # Observations len(crsp_raw) ``` ```{jupyter-execute} # Number of rows and columns (variables) crsp_raw.shape ``` - Data type ```{jupyter-execute} # List all variables in the dataset crsp_raw.columns ``` ```{jupyter-execute} # Data type of variables crsp_raw.dtypes crsp_raw.info() ``` ```{jupyter-execute} # Rename uppercase to lowercase # Typing lowercase is easier than uppercase crsp_raw.columns = crsp_raw.columns.str.lower() # Convert data type of date to date format # So that we can apply date functions when manipulating dates crsp_raw['date'] = pd.to_datetime(crsp_raw['date'], format='%Y%m%d') # Convert data type of ret to numerical value # Stock return should be numerical values rather than string # For example, # string: '1' + '2' = '12' # numerical: 1 + 2 = 3 crsp_raw['ret'] = pd.to_numeric(crsp_raw['ret'], errors='coerce') # Convert siccd to numerical value crsp_raw['siccd'] = pd.to_numeric(crsp_raw['siccd'], errors='coerce') crsp_raw.info() ``` - Stock exchanges and common shares ```{jupyter-execute} crsp = crsp_raw.copy() # Keep NYSE/AMEX/NASDAQ crsp = crsp[crsp['exchcd'].isin([1, 2, 3])] # Keep common shares crsp = crsp[crsp['shrcd'].isin([10, 11])] # Convert shrcd and exchcd to int crsp[['exchcd', 'shrcd']] = crsp[['exchcd', 'shrcd']].astype(int) len(crsp) ``` - Drop duplicates It is a good practice to check if there are duplicated observations. ```{jupyter-execute} crsp = crsp.drop_duplicates(['permno', 'date']) len(crsp) ``` **API reference**: pandas.dataframe.drop_duplicates
```{tip} Categorical data is a good way to reduce the memory usage especially when the data eats up too much your computer memory. ``` ```{jupyter-execute} temp = crsp.copy() # Convert shrcd and exchcd to category temp[['exchcd', 'shrcd']] = temp[['exchcd', 'shrcd']].astype('category') temp.info() ``` **!!!** 38M vs. 80M - Save data ```{code-block} python # Remember to change the file path on your machine outpath = '/Users/ml/Dropbox/teaching/data/crsp_month.txt' crsp.to_csv(outpath, sep='\t', index=False) ``` **API reference**: pandas.dataframe.to_csv - Read clean data ```{jupyter-execute} # If you want to import from local # file_path = '/Users/ml/Dropbox/teaching/data/crsp_month.txt' # crsp = pd.read_csv(file_path, sep='\t', parse_dates=['date']) # Import from url url = 'https://www.dropbox.com/s/0nuxwo3cf7vfcy3/crsp_month.txt?dl=1' crsp = pd.read_csv(url, sep='\t', parse_dates=['date']) ``` - Generate new variables ```{jupyter-execute} # Market value crsp['price'] = crsp['prc'].abs() crsp['me'] = (crsp['price']*crsp['shrout']) / 1000 crsp['lnme'] = np.log(crsp['me']) # Adjusted price crsp.loc[crsp['cfacpr']>0, 'adjprc'] = crsp['price'] / crsp['cfacpr'] # Holding period returns crsp['yyyymm'] = crsp['date'].dt.year*100 + crsp['date'].dt.month # Monthly index # From 1 to n # For example, month index will be from 1 to 120 if we have 120 months data month_idx = crsp.drop_duplicates('yyyymm')[['yyyymm']].copy() month_idx = month_idx.sort_values('yyyymm', ignore_index=True) month_idx['midx'] = month_idx.index + 1 crsp = crsp.merge(month_idx, how='left', on='yyyymm') # Past 6-month returns crsp['logret'] = np.log(crsp['ret']+1) crsp = crsp.sort_values(['permno', 'yyyymm'], ignore_index=True) crsp['hpr'] = (crsp.groupby('permno')['logret'] .rolling(window=6, min_periods=6).sum().reset_index(drop=True)) crsp['hpr'] = np.exp(crsp['hpr']) - 1 crsp['midx_lag'] = crsp.groupby('permno')['midx'].shift(5) crsp['gap'] = crsp['midx'] - crsp['midx_lag'] temp1 = crsp.query('permno==10028 & 201107<=yyyymm<=201212').copy() # Replace it by missing if there is month gap. crsp.loc[crsp['gap']!=5, 'hpr'] = np.nan temp2 = crsp.query('permno==10028 & 201107<=yyyymm<=201212').copy() ``` **API reference**: pandas.dataframe.loc **API reference**: pandas.dataframe.query
```{attention} Pay attention to month gaps when we use information over multiple periods. Your calculation might be right but data might not be always perfect. Do we have past 6-month data when we compute past 6-month returns? ``` ```{jupyter-execute} temp1[['permno', 'yyyymm', 'midx', 'midx_lag', 'gap', 'ret', 'hpr']] ``` ```{jupyter-execute} temp2[['permno', 'yyyymm', 'midx', 'midx_lag', 'gap', 'ret', 'hpr']] ``` Alternative way is to fill the gaps and assume the returns during the gaps are 0. This will keep more valid cumulative returns. - Summary statistics ```{jupyter-execute} crsp[['ret', 'lnme']].describe() ``` ```{jupyter-execute} # Percentiles crsp[['ret', 'lnme']].describe(percentiles=[0.1, 0.9]) ``` ```{jupyter-execute} # Summary statistics by year crsp['year'] = crsp['date'].dt.year round(crsp.groupby('year')[['ret', 'lnme']].describe() .loc[:, (slice(None), ['mean', '50%', 'std'])], 4) ``` ```{jupyter-execute} # Summary statistics by stock exchange round(crsp.groupby('exchcd')[['ret', 'lnme']].describe() .loc[:, (slice(None), ['mean', '50%', 'std'])], 4) ``` ```{jupyter-execute} # Summary statistics in subsamples print('Before 2015') print(crsp.query('year<=2015')[['ret', 'lnme']].describe()) print('\nAfter 2015') print(crsp.query('year>=2016')[['ret', 'lnme']].describe()) ``` **API reference**: pandas.dataframe.describe ## Stata - Import CRSP raw data - Import data from a url ```{code-block} stata local repo_url "https://raw.githubusercontent.com/mk0417/financial-database" local data_url "/master/data/crsp_demo.txt" local url = "`repo_url'" + "`data_url'" import delimited "`url'", clear ``` - Import data from local directory ```{code-block} stata clear set more off cd "/Users/ml/Dropbox/teaching/data/" import delimited crsp_month_raw.txt, clear // list first five observations l permno date shrcd exchcd ret in 1/5 ``` ```{code-block} stata-output +------------------------------------------------+ | permno date shrcd exchcd ret | |------------------------------------------------| 1. | 10026 20201231 11 3 0.072598 | 2. | 10028 20201231 11 2 0.125541 | 3. | 10032 20201231 11 3 0.046848 | 4. | 10044 20201231 11 3 -0.051522 | 5. | 10051 20201231 11 1 -0.030851 | +------------------------------------------------+ ``` - Data dimension ```{code-block} stata // Number of observations count di _N ``` - Data type ```{code-block} stata describe ``` ```{code-block} stata-output Contains data obs: 870,692 vars: 12 ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- permno long %12.0g PERMNO date long %12.0g shrcd byte %8.0g SHRCD exchcd byte %8.0g EXCHCD siccd str4 %9s SICCD ncusip str8 %9s NCUSIP permco long %12.0g PERMCO cusip str8 %9s CUSIP prc float %9.0g PRC ret str9 %9s RET shrout long %12.0g SHROUT cfacpr float %9.0g CFACPR ------------------------------------------------------------------------------- Sorted by: Note: Dataset has changed since last saved. ``` ```{code-block} stata // Convert date to date format tostring date, replace gen date1 = date(date, "YMD") l permno date date1 ret in 1/5 ``` ```{code-block} stata-output +-------------------------------------------+ | permno date date1 ret | |-------------------------------------------| 1. | 10001 20110131 31jan2011 0.028992 | 2. | 10001 20110228 28feb2011 0.022727 | 3. | 10001 20110331 31mar2011 0.072404 | 4. | 10001 20110429 29apr2011 -0.038789 | 5. | 10001 20110531 31may2011 0.028050 | +-------------------------------------------+ ``` ```{code-block} stata // Convert ret to numerical gen ret1 = real(ret) drop ret rename ret1 ret // Convert siccd to numerical gen siccd1 = real(siccd) drop siccd rename siccd1 siccd describe ``` ```{code-block} stata-output Contains data obs: 870,692 vars: 13 ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- permno long %12.0g PERMNO date str8 %9s shrcd byte %8.0g SHRCD exchcd byte %8.0g EXCHCD ncusip str8 %9s NCUSIP permco long %12.0g PERMCO cusip str8 %9s CUSIP prc float %9.0g PRC shrout long %12.0g SHROUT cfacpr float %9.0g CFACPR date1 float %td ret float %9.0g siccd float %9.0g ------------------------------------------------------------------------------- Sorted by: Note: Dataset has changed since last saved. ``` - Stock exchanges and common shares ```{code-block} stata // Keep NYSE/AMEX/NASDAQ keep if inlist(exchcd, 1, 2, 3) * Equivalently, * keep if exchcd==1 | exchcd==2 | exchcd==3 // Keep common shares keep if inlist(shrcd, 10, 11) * Equivalently * keep if shrcd==10 | shrcd==11 ``` - Drop duplicates ```{code-block} stata duplicates drop permno date, force ``` - Save data ```{code-block} stata save crsp_month, replace ``` - Read clean data ```{code-block} stata use crsp_month, clear ``` - Generate new variables ```{code-block} stata // Market value gen price = abs(prc) gen me = (price*shrout) / 1000 gen lnme = ln(me) // Adjusted price gen adjprc = price / cfacpr // Holding period return gen yyyymm = mofd(date1) format yyyymm %tm gen logret = ln(1+ret) sort permno yyyymm forvalues i=1/5 { by permno: gen l`i' = logret[_n-`i'] } gen hpr = logret+l1+l2+l3+l4+l5 by permno: gen yyyymm_lag = yyyymm[_n-5] format yyyymm_lag %tm gen gap = yyyymm - yyyymm_lag replace hpr = . if gap!=5 replace hpr = exp(hpr) - 1 ``` - Summary statistics ```{code-block} stata summ ret lnme summ ret lnme, d // Summary statistics by year gen year = year(date1) tabstat ret lnme, by(year) stat(mean p50 sd) long tabstat ret lnme, by(year) stat(mean p50 sd) long nototal tabstat ret lnme, by(year) stat(mean p50 sd) long nototal col(stat) // Summary statistics by stock exchange tabstat ret lnme, by(exchcd) stat(mean p50 sd) long nototal col(stat) // Summary statistics in subsamples summ ret lnme if year<=2015 summ ret lnme if year>2015 ```