Class02

CRSP overview

  • Data: security price (and many other trading related data)

  • Coverage: US stock exchanges

  • Time period: 1925 - present

  • Frequency: daily and monthly

Web access

Downlaod data from CRSP via WRDS web access.

  • Dataset: CRSP monthly stock file

  • Time period: Jan 2011 to Dec 2020

  • Search the entire database

  • Variables

    • shrcd

    • exchcd

    • siccd

    • prc

    • ret

    • shrout

    • cfacpr

    • permco

    • cusip

    • ncusip

  • Output format

    • Tab-delimited text

    • Uncompressed

    • Date format: YYMMDDn8

Identifiers

PERMNO
CRSP permanent security level identifier
They are permanent and never change
PERMCO
CRSP permanent firm level identifier
One PERMCO might have different PERMNO because a company can issue different securities. For example, Alphabet (Google) has class A and class C shares.
They are permanent and never change
CUSIP
Current 8-digit CUSIP (security level)
They can be changed. And we can use NCUSIP to track the change history.
NCUSIP
Historical 8-digit CUSIP (security level)

Note

TICKER and company name (COMNAM) are also available identifiers in CRSP. However, we usually do not use them to match stocks/firms unless we have no other widely used identifiers.

Stock exchange code

EXCHCD
1
NYSE
2
AMEX
3
NASDAQ
Chinco, Neuhierl and Weber. (2021). Journal of Financial Economics.
Lee and Swaminathan. (2000). Journal of Finance.

See other stock exchange codes here

Share code

SHRCD
10 or 11
Common stocks
Antoniou, Doukas and Subrahmanyam. (2016). Management Science.
Han, Huang, Huang and Zhou. (forthcoming). Journal of Financial Economics.

See other share codes here

Stock price

PRC
Stock price (unadjusted)
Some stock prices have negative sign (-)
Need to take absolute value of price
  • $\text{adjusted price} = \frac{\left|prc\right|}{cfacpr}$

  • It is important to adjust stock price when evaluating stock performance

Take the example of Microsoft stock price. The vertical red lines indicate corporate actions (e.g. stock splits, dividends). Clearly, the stock return is not correct if unadjusted price is applied.

Note

RET (stock return) in CRSP is already adjusted. You can download it and there is no need to do any adjustment.

Market value

$\text{market value (millions)} = \frac{|prc| \times shrout}{1000}$

SHROUT
number of shares outstanding (in 1,000 shares)

Volume

VOL
Number of traded shares
1 share in Daily Stock File
100 shares in Monthly Stock File

Python

  • Import packages

import pandas as pd
import numpy as np
  • Import CRSP raw data

    • Import data from local directory

    # Please make sure to change the path to where you save your data
    file_path = '/Users/ml/Dropbox/teaching/data/crsp_month_raw.txt'
    crsp_raw = pd.read_csv(file_path, sep='\t', low_memory=False)
    
    # View first 5 rows
    # crsp_raw.head()
    
    • Import data from a url

    # This url contains sample data from CRSP
    url = 'https://www.dropbox.com/s/6mk86g97uji2f80/crsp_month_raw.txt?dl=1'
    crsp_raw = pd.read_csv(url, sep='\t', low_memory=False)
    
    # View first 5 rows and first 5 columns
    crsp_raw.iloc[:5, :5]
    
    PERMNO date SHRCD EXCHCD SICCD
    0 10001 20110131 11.0 2.0 4925
    1 10001 20110228 11.0 2.0 4925
    2 10001 20110331 11.0 2.0 4925
    3 10001 20110429 11.0 2.0 4925
    4 10001 20110531 11.0 2.0 4925

    API reference: pandas.read_csv


    Attention

    Comments

    • Any lines starting with # will be treated as comments and Python will not execute the lines.

      • This is useful when you need to write some comments or notes

    Folder path and operating system

    • Use pathlib library is a better solution if you want to run your codes without problem on different operating systems.

      • Import the package: from pathlib import Path

    • MacOS (Linux)

      • Forward slash: /folder/file.txt

    • Windows

      • Backslash: C:\folder\file.txt

      • Please use double slashes when typing path: C:\\folder\\file.txt

    • Check the folder path tips

  • Data dimension

# Observations
len(crsp_raw)
870692
# Number of rows and columns (variables)
crsp_raw.shape
(870692, 12)
  • Data type

# List all variables in the dataset
crsp_raw.columns
Index(['PERMNO', 'date', 'SHRCD', 'EXCHCD', 'SICCD', 'NCUSIP', 'PERMCO',
       'CUSIP', 'PRC', 'RET', 'SHROUT', 'CFACPR'],
      dtype='object')
# Data type of variables
crsp_raw.dtypes

crsp_raw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 870692 entries, 0 to 870691
Data columns (total 12 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   PERMNO  870692 non-null  int64  
 1   date    870692 non-null  int64  
 2   SHRCD   865287 non-null  float64
 3   EXCHCD  865287 non-null  float64
 4   SICCD   865287 non-null  object 
 5   NCUSIP  865287 non-null  object 
 6   PERMCO  870692 non-null  int64  
 7   CUSIP   870692 non-null  object 
 8   PRC     855391 non-null  float64
 9   RET     860572 non-null  object 
 10  SHROUT  864704 non-null  float64
 11  CFACPR  864704 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 79.7+ MB
# Rename uppercase to lowercase
# Typing lowercase is easier than uppercase
crsp_raw.columns = crsp_raw.columns.str.lower()

# Convert data type of date to date format
# So that we can apply date functions when manipulating dates
crsp_raw['date'] = pd.to_datetime(crsp_raw['date'], format='%Y%m%d')

# Convert data type of ret to numerical value
# Stock return should be numerical values rather than string
# For example,
#  string: '1' + '2' = '12'
#  numerical: 1 + 2 = 3
crsp_raw['ret'] = pd.to_numeric(crsp_raw['ret'], errors='coerce')

# Convert siccd to numerical value
crsp_raw['siccd'] = pd.to_numeric(crsp_raw['siccd'], errors='coerce')

crsp_raw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 870692 entries, 0 to 870691
Data columns (total 12 columns):
 #   Column  Non-Null Count   Dtype         
---  ------  --------------   -----         
 0   permno  870692 non-null  int64         
 1   date    870692 non-null  datetime64[ns]
 2   shrcd   865287 non-null  float64       
 3   exchcd  865287 non-null  float64       
 4   siccd   865153 non-null  float64       
 5   ncusip  865287 non-null  object        
 6   permco  870692 non-null  int64         
 7   cusip   870692 non-null  object        
 8   prc     855391 non-null  float64       
 9   ret     849541 non-null  float64       
 10  shrout  864704 non-null  float64       
 11  cfacpr  864704 non-null  float64       
dtypes: datetime64[ns](1), float64(7), int64(2), object(2)
memory usage: 79.7+ MB
  • Stock exchanges and common shares

crsp = crsp_raw.copy()

# Keep NYSE/AMEX/NASDAQ
crsp = crsp[crsp['exchcd'].isin([1, 2, 3])]

# Keep common shares
crsp = crsp[crsp['shrcd'].isin([10, 11])]

# Convert shrcd and exchcd to int
crsp[['exchcd', 'shrcd']] = crsp[['exchcd', 'shrcd']].astype(int)

len(crsp)
444358
  • Drop duplicates It is a good practice to check if there are duplicated observations.

crsp = crsp.drop_duplicates(['permno', 'date'])
len(crsp)
444358

API reference: pandas.dataframe.drop_duplicates


Tip

Categorical data is a good way to reduce the memory usage especially when the data eats up too much your computer memory.

temp = crsp.copy()
# Convert shrcd and exchcd to category
temp[['exchcd', 'shrcd']] = temp[['exchcd', 'shrcd']].astype('category')
temp.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 444358 entries, 0 to 870691
Data columns (total 12 columns):
 #   Column  Non-Null Count   Dtype         
---  ------  --------------   -----         
 0   permno  444358 non-null  int64         
 1   date    444358 non-null  datetime64[ns]
 2   shrcd   444358 non-null  category      
 3   exchcd  444358 non-null  category      
 4   siccd   444256 non-null  float64       
 5   ncusip  444358 non-null  object        
 6   permco  444358 non-null  int64         
 7   cusip   444358 non-null  object        
 8   prc     441959 non-null  float64       
 9   ret     439721 non-null  float64       
 10  shrout  444036 non-null  float64       
 11  cfacpr  444036 non-null  float64       
dtypes: category(2), datetime64[ns](1), float64(5), int64(2), object(2)
memory usage: 38.1+ MB

!!! 38M vs. 80M

  • Save data

# Remember to change the file path on your machine
outpath = '/Users/ml/Dropbox/teaching/data/crsp_month.txt'
crsp.to_csv(outpath, sep='\t', index=False)

API reference: pandas.dataframe.to_csv

  • Read clean data

# If you want to import from local
# file_path = '/Users/ml/Dropbox/teaching/data/crsp_month.txt'
# crsp = pd.read_csv(file_path, sep='\t', parse_dates=['date'])

# Import from url
url = 'https://www.dropbox.com/s/0nuxwo3cf7vfcy3/crsp_month.txt?dl=1'
crsp = pd.read_csv(url, sep='\t', parse_dates=['date'])
  • Generate new variables

# Market value
crsp['price'] = crsp['prc'].abs()
crsp['me'] = (crsp['price']*crsp['shrout']) / 1000
crsp['lnme'] = np.log(crsp['me'])

# Adjusted price
crsp.loc[crsp['cfacpr']>0, 'adjprc'] = crsp['price'] / crsp['cfacpr']

# Holding period returns
crsp['yyyymm'] = crsp['date'].dt.year*100 + crsp['date'].dt.month

# Monthly index
# From 1 to n
# For example, month index will be from 1 to 120 if we have 120 months data
month_idx = crsp.drop_duplicates('yyyymm')[['yyyymm']].copy()
month_idx = month_idx.sort_values('yyyymm', ignore_index=True)
month_idx['midx'] = month_idx.index + 1

crsp = crsp.merge(month_idx, how='left', on='yyyymm')

# Past 6-month returns
crsp['logret'] = np.log(crsp['ret']+1)
crsp = crsp.sort_values(['permno', 'yyyymm'], ignore_index=True)
crsp['hpr'] = (crsp.groupby('permno')['logret']
  .rolling(window=6, min_periods=6).sum().reset_index(drop=True))
crsp['hpr'] = np.exp(crsp['hpr']) - 1

crsp['midx_lag'] = crsp.groupby('permno')['midx'].shift(5)
crsp['gap'] = crsp['midx'] - crsp['midx_lag']

temp1 = crsp.query('permno==10028 & 201107<=yyyymm<=201212').copy()

# Replace it by missing if there is month gap.
crsp.loc[crsp['gap']!=5, 'hpr'] = np.nan

temp2 = crsp.query('permno==10028 & 201107<=yyyymm<=201212').copy()

API reference: pandas.dataframe.loc

API reference: pandas.dataframe.query


Attention

Pay attention to month gaps when we use information over multiple periods. Your calculation might be right but data might not be always perfect. Do we have past 6-month data when we compute past 6-month returns?

temp1[['permno', 'yyyymm', 'midx', 'midx_lag', 'gap', 'ret', 'hpr']]
permno yyyymm midx midx_lag gap ret hpr
305 10028 201107 7 2.0 5.0 0.247887 1.004524
306 10028 201108 8 3.0 5.0 -0.032731 1.056241
307 10028 201109 9 4.0 5.0 -0.054842 0.588236
308 10028 201110 10 5.0 5.0 0.029630 0.437932
309 10028 201111 11 6.0 5.0 -0.070743 0.347828
310 10028 201112 12 7.0 5.0 -0.036129 0.052114
311 10028 201201 13 8.0 5.0 -0.026774 -0.179457
312 10028 201202 14 9.0 5.0 0.081155 -0.082847
313 10028 201203 15 10.0 5.0 -0.048346 -0.076543
314 10028 201211 23 11.0 12.0 -0.246524 -0.324221
315 10028 201212 24 12.0 12.0 -0.036551 -0.299355
temp2[['permno', 'yyyymm', 'midx', 'midx_lag', 'gap', 'ret', 'hpr']]
permno yyyymm midx midx_lag gap ret hpr
305 10028 201107 7 2.0 5.0 0.247887 1.004524
306 10028 201108 8 3.0 5.0 -0.032731 1.056241
307 10028 201109 9 4.0 5.0 -0.054842 0.588236
308 10028 201110 10 5.0 5.0 0.029630 0.437932
309 10028 201111 11 6.0 5.0 -0.070743 0.347828
310 10028 201112 12 7.0 5.0 -0.036129 0.052114
311 10028 201201 13 8.0 5.0 -0.026774 -0.179457
312 10028 201202 14 9.0 5.0 0.081155 -0.082847
313 10028 201203 15 10.0 5.0 -0.048346 -0.076543
314 10028 201211 23 11.0 12.0 -0.246524 NaN
315 10028 201212 24 12.0 12.0 -0.036551 NaN

Alternative way is to fill the gaps and assume the returns during the gaps are 0. This will keep more valid cumulative returns.

  • Summary statistics

crsp[['ret', 'lnme']].describe()
ret lnme
count 439721.000000 441959.000000
mean 0.011124 6.421212
std 0.177839 2.155586
min -0.993600 -2.426619
25% -0.058187 4.866619
50% 0.004907 6.397188
75% 0.066725 7.890045
max 19.883589 14.629090
# Percentiles
crsp[['ret', 'lnme']].describe(percentiles=[0.1, 0.9])
ret lnme
count 439721.000000 441959.000000
mean 0.011124 6.421212
std 0.177839 2.155586
min -0.993600 -2.426619
10% -0.140673 3.625798
50% 0.004907 6.397188
90% 0.152479 9.241648
max 19.883589 14.629090
# Summary statistics by year
crsp['year'] = crsp['date'].dt.year
round(crsp.groupby('year')[['ret', 'lnme']].describe()
.loc[:, (slice(None), ['mean', '50%', 'std'])], 4)
ret lnme ret lnme ret lnme
mean mean 50% 50% std std
year
2011 -0.0064 6.0642 -0.0089 6.0190 0.1486 2.0849
2012 0.0161 6.1141 0.0102 6.0783 0.1459 2.0934
2013 0.0325 6.3779 0.0228 6.3643 0.1331 2.0869
2014 0.0041 6.5273 0.0026 6.4729 0.1388 2.0568
2015 -0.0048 6.4665 -0.0062 6.4348 0.1783 2.1173
2016 0.0155 6.3939 0.0091 6.3736 0.1734 2.1637
2017 0.0134 6.5546 0.0066 6.5825 0.1531 2.1840
2018 -0.0120 6.6320 -0.0089 6.6748 0.1615 2.1934
2019 0.0197 6.5616 0.0137 6.5667 0.2078 2.2513
2020 0.0346 6.5497 0.0095 6.4663 0.2837 2.2452
# Summary statistics by stock exchange
round(crsp.groupby('exchcd')[['ret', 'lnme']].describe()
.loc[:, (slice(None), ['mean', '50%', 'std'])], 4)
ret lnme ret lnme ret lnme
mean mean 50% 50% std std
exchcd
1 0.0106 7.8429 0.0095 7.8152 0.1306 1.7410
2 0.0061 4.0257 -0.0104 3.8878 0.2336 1.4255
3 0.0119 5.7896 0.0030 5.6982 0.1956 1.9301
# Summary statistics in subsamples
print('Before 2015')
print(crsp.query('year<=2015')[['ret', 'lnme']].describe())
print('\nAfter 2015')
print(crsp.query('year>=2016')[['ret', 'lnme']].describe())
Before 2015
                 ret           lnme
count  223628.000000  224712.000000
mean        0.008081       6.308403
std         0.150606       2.096421
min        -0.935356      -0.896832
25%        -0.054922       4.771121
50%         0.004207       6.272747
75%         0.061369       7.744562
max        15.984456      13.528774

After 2015
                 ret           lnme
count  216093.000000  217247.000000
mean        0.014274       6.537897
std         0.202149       2.209074
min        -0.993600      -2.426619
25%        -0.062021       4.975455
50%         0.005594       6.533329
75%         0.072987       8.032318
max        19.883589      14.629090

API reference: pandas.dataframe.describe

Stata

  • Import CRSP raw data

    • Import data from a url

    local repo_url "https://raw.githubusercontent.com/mk0417/financial-database"
    local data_url "/master/data/crsp_demo.txt"
    local url = "`repo_url'" + "`data_url'"
    import delimited "`url'", clear
    
    • Import data from local directory

    clear
    set more off
    
    cd "/Users/ml/Dropbox/teaching/data/"
    
    import delimited crsp_month_raw.txt, clear
    
    // list first five observations
    l permno date shrcd exchcd ret in 1/5
    
       +------------------------------------------------+
       | permno       date   shrcd   exchcd         ret |
       |------------------------------------------------|
    1. |  10026   20201231      11        3    0.072598 |
    2. |  10028   20201231      11        2    0.125541 |
    3. |  10032   20201231      11        3    0.046848 |
    4. |  10044   20201231      11        3   -0.051522 |
    5. |  10051   20201231      11        1   -0.030851 |
       +------------------------------------------------+
    
  • Data dimension

// Number of observations
count
di _N
  • Data type

describe
Contains data
  obs:       870,692
 vars:            12
-------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------
permno          long    %12.0g                PERMNO
date            long    %12.0g
shrcd           byte    %8.0g                 SHRCD
exchcd          byte    %8.0g                 EXCHCD
siccd           str4    %9s                   SICCD
ncusip          str8    %9s                   NCUSIP
permco          long    %12.0g                PERMCO
cusip           str8    %9s                   CUSIP
prc             float   %9.0g                 PRC
ret             str9    %9s                   RET
shrout          long    %12.0g                SHROUT
cfacpr          float   %9.0g                 CFACPR
-------------------------------------------------------------------------------
Sorted by:
     Note: Dataset has changed since last saved.
// Convert date to date format
tostring date, replace
gen date1 = date(date, "YMD")

l permno date date1 ret in 1/5
     +-------------------------------------------+
     | permno       date       date1         ret |
     |-------------------------------------------|
  1. |  10001   20110131   31jan2011    0.028992 |
  2. |  10001   20110228   28feb2011    0.022727 |
  3. |  10001   20110331   31mar2011    0.072404 |
  4. |  10001   20110429   29apr2011   -0.038789 |
  5. |  10001   20110531   31may2011    0.028050 |
     +-------------------------------------------+
// Convert ret to numerical
gen ret1 = real(ret)
drop ret
rename ret1 ret

// Convert siccd to numerical
gen siccd1 = real(siccd)
drop siccd
rename siccd1 siccd

describe
Contains data
  obs:       870,692
 vars:            13
-------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------
permno          long    %12.0g                PERMNO
date            str8    %9s
shrcd           byte    %8.0g                 SHRCD
exchcd          byte    %8.0g                 EXCHCD
ncusip          str8    %9s                   NCUSIP
permco          long    %12.0g                PERMCO
cusip           str8    %9s                   CUSIP
prc             float   %9.0g                 PRC
shrout          long    %12.0g                SHROUT
cfacpr          float   %9.0g                 CFACPR
date1           float   %td
ret             float   %9.0g
siccd           float   %9.0g
-------------------------------------------------------------------------------
Sorted by:
     Note: Dataset has changed since last saved.
  • Stock exchanges and common shares

// Keep NYSE/AMEX/NASDAQ
keep if inlist(exchcd, 1, 2, 3)
* Equivalently,
* keep if exchcd==1 | exchcd==2 | exchcd==3

// Keep common shares
keep if inlist(shrcd, 10, 11)
* Equivalently
* keep if shrcd==10 | shrcd==11
  • Drop duplicates

duplicates drop permno date, force
  • Save data

save crsp_month, replace
  • Read clean data

use crsp_month, clear
  • Generate new variables

// Market value
gen price = abs(prc)
gen me = (price*shrout) / 1000
gen lnme = ln(me)

// Adjusted price
gen adjprc = price / cfacpr

// Holding period return
gen yyyymm = mofd(date1)
format yyyymm %tm

gen logret = ln(1+ret)

sort permno yyyymm

forvalues i=1/5 {
	by permno: gen l`i' = logret[_n-`i']
}

gen hpr = logret+l1+l2+l3+l4+l5

by permno: gen yyyymm_lag = yyyymm[_n-5]
format yyyymm_lag %tm
gen gap = yyyymm - yyyymm_lag

replace hpr = . if gap!=5

replace hpr = exp(hpr) - 1
  • Summary statistics

summ ret lnme

summ ret lnme, d

// Summary statistics by year
gen year = year(date1)

tabstat ret lnme, by(year) stat(mean p50 sd) long

tabstat ret lnme, by(year) stat(mean p50 sd) long nototal

tabstat ret lnme, by(year) stat(mean p50 sd) long nototal col(stat)

// Summary statistics by stock exchange
tabstat ret lnme, by(exchcd) stat(mean p50 sd) long nototal col(stat)

// Summary statistics in subsamples
summ ret lnme if year<=2015

summ ret lnme if year>2015