Class03¶

Compustat North America overview¶

Data: accounting and price data
Coverage: US and Canada
Time period: 1950 - present (Fundamentals Annual)
Frequency
- Accounting data: yearly and quarterly
- Price data: daily and monthly
Dataset structure
- Fundamentals Annual
- Fundamentals Quarterly
- Security Daily
- Security Monthly

Web access¶

Downlaod data from Compustat North America via WRDS web access.

Dataset: Compustat North Ameria - Fundamentals Annual
Time period: Jan 2000 to Dec 2020
Search the entire database
Screening variables
- Consolidation Level: C
- Industry Format: INDL
- Data Format: STD
- Population Source: D
- Currency: USD
- Company Status: Active (A) and Inactive (I)
Variables
- gvkey
- cusip
- exchg
- sic
- fyear
- at
- seq
- ebit
Output format
- Tab-delimited text
- Uncompressed
- Date format: YYMMDDn8

Industry Format¶

INDFMT: FS - financial services (includes banks, insurance companies, broker/dealers, real estate and other financial services); INDL - other than financial services

gvkey	fyear	indfmt	at	capr1
007647	2020	FS	2819627.0	13.5
007647	2020	INDL	2819627.0

Identifiers¶

GVKEY: Compustat permanent firm level identifier; They are permanent and never change
CUSIP: Current 9-digit CUSIP (security level); They can be changed.
CIK: Central Index Key used for SEC filings

Attention

Leading zeros of GVKEY

GVKEY is 6-digit character in string fromat. However, GVKEY contains numbers only and it will be recoganised as numerical values in some software by default. Therefore, the leading zeros will be dropped. For example, GVKEY 001050 might become 1050. This will cause matching problem if one dataset has 6-digit GVKEY and another one has GVKEY without leading zero.

Datadate vs fiscal year¶

Datadate is the date of fiscal year end
Fiscal year (fyear)

Calendar Year	Fiscal Year-end Month	Fiscal Year
2019	January	2018
2019	February	2018
2019	March	2018
2019	April	2018
2019	May	2018
2019	June	2019
2019	July	2019
2019	August	2019
2019	September	2019
2019	October	2019
2019	November	2019
2019	December	2019

Attention

Keep in mind that we do not know the accounting information on datadate. The accounting numbers are public avaiable after companies release their financial reports.

Fama and French. (1992). Journal of Finance.

Source: http://www.sec.gov/about/forms/form10-k.pdf

Source: https://www.investor.gov/introduction-investing/investing-basics/glossary/form-10-k

Stock exchange code¶

11: NYSE
12: AMEX
14: NASDAQ
0: Subsidiary/Private
19: OTC

See other stock exchange codes here

Python¶

Import packages

import pandas as pd
import numpy as np

Import Compustat North America raw data

url = 'https://www.dropbox.com/s/qro99n7ngq0aztr/funda_raw.txt?dl=1'
funda_raw = pd.read_csv(url, sep='\t', low_memory=False)

# View first 5 rows and first 5 columns
funda_raw.iloc[:5, :5]

	gvkey	datadate	fyear	indfmt	consol
0	1004	20000531	1999	INDL	C
1	1004	20010531	2000	INDL	C
2	1004	20020531	2001	INDL	C
3	1004	20030531	2002	INDL	C
4	1004	20040531	2003	INDL	C

# Data types
funda_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 199080 entries, 0 to 199079
Data columns (total 16 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   gvkey     199080 non-null  int64  
 1   datadate  199080 non-null  int64  
 2   fyear     199080 non-null  int64  
 3   indfmt    199080 non-null  object 
 4   consol    199080 non-null  object 
 5   popsrc    199080 non-null  object 
 6   datafmt   199080 non-null  object 
 7   tic       198983 non-null  object 
 8   cusip     198983 non-null  object 
 9   curcd     199080 non-null  object 
 10  at        161570 non-null  float64
 11  ebit      160025 non-null  float64
 12  seq       161564 non-null  float64
 13  exchg     198983 non-null  float64
 14  costat    199080 non-null  object 
 15  sic       199075 non-null  float64
dtypes: float64(5), int64(3), object(8)
memory usage: 24.3+ MB

# Read the data again with consideration of the data type of gvkey
funda_raw = pd.read_csv(url, sep='\t', low_memory=False, dtype={'gvkey': str})

# View first 5 rows and first 5 columns
funda_raw.iloc[:5, :5]

	gvkey	datadate	fyear	indfmt	consol
0	001004	20000531	1999	INDL	C
1	001004	20010531	2000	INDL	C
2	001004	20020531	2001	INDL	C
3	001004	20030531	2002	INDL	C
4	001004	20040531	2003	INDL	C

# Data types
funda_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 199080 entries, 0 to 199079
Data columns (total 16 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   gvkey     199080 non-null  object 
 1   datadate  199080 non-null  int64  
 2   fyear     199080 non-null  int64  
 3   indfmt    199080 non-null  object 
 4   consol    199080 non-null  object 
 5   popsrc    199080 non-null  object 
 6   datafmt   199080 non-null  object 
 7   tic       198983 non-null  object 
 8   cusip     198983 non-null  object 
 9   curcd     199080 non-null  object 
 10  at        161570 non-null  float64
 11  ebit      160025 non-null  float64
 12  seq       161564 non-null  float64
 13  exchg     198983 non-null  float64
 14  costat    199080 non-null  object 
 15  sic       199075 non-null  float64
dtypes: float64(5), int64(2), object(9)
memory usage: 24.3+ MB

Clean dataset

funda_raw = funda_raw.sort_values(['gvkey', 'fyear', 'datadate'], ignore_index=True)
print(f'Number of obs: {len(funda_raw)}')

# Keep the most recent one if there are duplicates
funda = funda_raw.drop_duplicates(['gvkey', 'fyear'], keep='last').copy()
print(f'Number of obs after removing duplicates: {len(funda)}')

# Keep main stock exchanges
funda = funda[funda['exchg'].isin([11, 12, 14])].copy()
print(f'Number of obs after keeping main stock exchanges: {len(funda)}')

# Drop unnecessary variables
drop_vars = ['indfmt', 'consol', 'popsrc', 'datafmt', 'costat', 'curcd']
funda = funda.drop(columns=drop_vars)

# Drop if total asset and equity are not positive
funda = funda.query('at>0 & seq>0').copy()
print(f'Number of obs after requiring positive asset: {len(funda)}')

Number of obs: 199080
Number of obs after removing duplicates: 199080
Number of obs after keeping main stock exchanges: 125063
Number of obs after requiring positive asset: 101899

API reference: pandas.dataframe.sort_values

Firm-year obs in each stock exchange

funda['exchg'].value_counts()

0    57467
0    39778
0     4654
Name: exchg, dtype: int64

Seasonality of fiscal year end

funda['date'] = pd.to_datetime(funda['datadate'], format='%Y%m%d')
funda['month'] = funda['date'].dt.month
funda['month'].value_counts()

  78281
    5295
    5055
    4175
    2690
   1245
    1088
     918
     913
     900
     699
    640
Name: month, dtype: int64

ROA and ROE

funda['roa'] = funda['ebit'] / funda['at']
funda['roe'] = funda['ebit'] / funda['seq']

Asset growth

Generate lagged variables and calculate growth rate

funda = funda.sort_values(['gvkey', 'fyear'], ignore_index=True)
funda['lag_at'] = funda.groupby('gvkey')['at'].shift(1)
funda['lag_fyear'] = funda.groupby('gvkey')['fyear'].shift(1)
funda['ag'] = funda['at'] / funda['lag_at'] - 1

Check year gap

funda['fyear_diff'] = funda['fyear'] - funda['lag_fyear']
funda.query('fyear_diff>1')[['gvkey', 'fyear', 'lag_fyear', 'fyear_diff', 'ag']].head()

	gvkey	fyear	lag_fyear	fyear_diff	ag
42	001038	2001	1999.0	2.0	0.076013
49	001045	2007	2003.0	4.0	-0.025878
50	001045	2014	2007.0	7.0	0.532008
283	001161	2009	2007.0	2.0	-0.214026
289	001161	2016	2014.0	2.0	-0.118397

Deal with gap

funda.loc[funda['fyear_diff']>1, 'ag'] = np.nan

Summary statistics

round(funda[['roa', 'roe', 'ag']].describe(), 3)

	roa	roe	ag
count	100932.000	100932.000	90153.000
mean	0.012	0.105	8.231
std	1.601	62.213	1456.622
min	-44.088	-16772.000	-1.000
25%	0.001	0.002	-0.024
50%	0.043	0.139	0.060
75%	0.096	0.249	0.192
max	226.310	7305.250	409183.500

Deal with outliers

# Winsorising at 1% and 99% percentile
for i in ['roa', 'roe', 'ag']:
    funda['p1'] = funda.groupby('fyear')[i].transform(lambda x: x.quantile(0.01))
    funda['p99'] = funda.groupby('fyear')[i].transform(lambda x: x.quantile(0.99))
    funda.loc[funda[i]<funda['p1'], i] = funda['p1']
    funda.loc[funda[i]>funda['p99'], i] = funda['p99']
    funda = funda.drop(columns=['p1', 'p99'])

round(funda[['roa', 'roe', 'ag']].describe(), 3)

	roa	roe	ag
count	100932.000	100932.000	90153.000
mean	-0.002	0.048	0.189
std	0.230	0.652	0.608
min	-1.645	-6.951	-0.616
25%	0.001	0.002	-0.024
50%	0.043	0.139	0.060
75%	0.096	0.249	0.192
max	0.432	2.902	10.965

Stata¶

Import data from a url

local data_url "https://www.dropbox.com/s/qro99n7ngq0aztr/funda_raw.txt?dl=1"
import delimited "`data_url'", clear

List first five observations

l gvkey datadate at seq ebit in 1/5

Data types

describe

Specify data type when importing data

local data_url "https://www.dropbox.com/s/qro99n7ngq0aztr/funda_raw.txt?dl=1"
import delimited "`data_url'", stringcols(1) clear

describe

l gvkey datadate at seq ebit in 1/5

Keep the most recent one if there are duplicates

sort gvkey fyear datadate

by gvkey fyear: gen id = _n
by gvkey fyear: gen n = _N

keep if id == n

Keep main stock exchanges

keep if exchg==11 | exchg==12 | exchg==14

Drop unnecessary variables

drop indfmt consol popsrc datafmt costat curcd

Drop if total asset and equity are not positive

// Pay attention:
//  missing value (.) is larger than any other values
//  at>0 does not remove missing values
//  seq>0 does not remove missing values
//  This is different from Python

keep if at>0 & seq>0 & at!=. & seq!=.

Firm-year observations in each stock exchange

tab exchg

Seasonality of fiscal year end

tostring datadate, replace
gen date = date(datadate, "YMD")
format date %td

gen month = month(date)

tab month

ROA and ROE

gen roa = ebit / at

gen roe = ebit / seq

Asset growth

sort gvkey fyear

// Generate lagged variables and calculate growth rate
by gvkey: gen lag_at = at[_n-1]
by gvkey: gen lag_fyear = fyear[_n-1]
gen ag = at / lag_at - 1

// Check year gap
gen fyear_diff = fyear - lag_fyear

// Deal with gap
replace ag = . if fyear_diff>1

Summary statistics

summ roa roe ag

Deal with outliers

foreach i of varlist roa roe ag {
	bysort fyear: egen p1 = pctile(`i'), p(1)
	bysort fyear: egen p99 = pctile(`i'), p(99)
	replace `i' = p1 if `i' < p1
	replace `i' = p99 if `i' > p99
	drop p1 p99
}

sort gvkey fyear

summ roa roe ag

Class03¶

Compustat North America overview¶

Web access¶

Industry Format¶

Identifiers¶

Datadate vs fiscal year¶

Stock exchange code¶

Python¶

Stata¶

Table of Contents

Related Topics

This Page