Hypothesis Testing using T-Tests
Overview
Cleans, compiles, and tests a dataset to evaluate the hypothesis that the financial value of homes in University towns are less effected by recessions. The hypothesis testing is accomplished by a statistical T-Test.
Datasets
- From the Zillow research data site there is housing data for the United States. In particular the datafile for homes at a city level,
City_Zhvi_AllHomes.csv
, has median home sale prices at a fine grained level. - From the Wikipedia page on college towns is a list of university towns in the United States which has been copy and pasted into the file
university_towns.txt
. - From Bureau of Economic Analysis, US Department of Commerce, the GDP over time of the United States in current dollars (chained value in 2009 dollars will be used), in quarterly intervals, in the file
gdplev.xls
.
Definitions
- A quarter is a specific three month period, Q1 is January through March, Q2 is April through June, Q3 is July through September, Q4 is October through December.
- A recession is defined as starting with two consecutive quarters of GDP decline, and ending with two consecutive quarters of GDP growth.
- A recession bottom is the quarter within a recession which had the lowest GDP.
A university town is a city which has a high percentage of university students compared to the total population of the city.
The Alternative Hypothesis, which is ultimately being tested, is that home prices in University towns are impacted less by recessions than homes in non-University towns.
The Null Hypothesis, which we home to “reject,” is that homes in University towns are not impacted less than homes in non-University towns by recessions.
Both tests will be conducted examining the price ratio, defined above, for homes in university towns and non-university towns.
$$price\ ratio = \frac{price\ at\ recession\ bottom}{price\ before\ recession}$$
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind
Get List of University Towns
The result of this step is a list of states and “RegionNames” (or cities) that are labeled as University towns according to Wikipedia.
t = (pd.read_csv('data/university_towns.txt',
sep='/n',
engine='python',
header=None)
.rename(columns={0:'State'}))
t.loc[~t['State'].str.contains("[edit]",
regex=False),
'RegionName'] = t['State']
t.loc[t['State']==t['RegionName'],
'State'] = np.nan
t['State'] = (t['State'].str.replace(r'\[.*','')
.fillna(method='ffill'))
t['RegionName'] = t['RegionName'].str.replace(r'\W*\(.*','')
t = (t.drop(t[t['RegionName'].isna()].index)
.reset_index(drop=True))
t['Type'] = 'University Town'
print('t (rows, columns) = ' + str(t.shape))
t.head()
t (rows, columns) = (517, 3)
State | RegionName | Type | |
---|---|---|---|
0 | Alabama | Auburn | University Town |
1 | Alabama | Florence | University Town |
2 | Alabama | Jacksonville | University Town |
3 | Alabama | Livingston | University Town |
4 | Alabama | Montevallo | University Town |
Get GDP Timeseries
The result of this step is a list of “YearQ,” or combinations of years and quarter labels, and the corresponding US GDP, as measured in chained value in 2009 dollars.
gdp = (pd.read_excel('data/gdplev.xls',
skiprows=7)
.drop(columns=['Unnamed: 0','Unnamed: 1','Unnamed: 2',
'Unnamed: 3','Unnamed: 5','Unnamed: 7'])
.rename(columns={'Unnamed: 4':'YearQ',
'Unnamed: 6':'GDP'}))
gdp = (gdp[gdp['YearQ']>'2000'].copy()
.reset_index(drop=True))
print('gdp (rows, columns) = ' + str(gdp.shape))
gdp.head()
gdp (rows, columns) = (66, 2)
YearQ | GDP | |
---|---|---|
0 | 2000q1 | 12359.1 |
1 | 2000q2 | 12592.5 |
2 | 2000q3 | 12607.7 |
3 | 2000q4 | 12679.3 |
4 | 2001q1 | 12643.3 |
Get Critical Points of Recession
This step utilizes the gdp timeseries data in the previous step to determine the start, bottom, and end of the most recent recession, with those terms as defined above.
rec_start = gdp[(gdp['GDP'] > gdp['GDP'].shift(-1))&
(gdp['GDP'].shift(-1) > gdp['GDP'].shift(-2))].copy()
rec_start_year_q = rec_start.iloc[0,0]
gdp = gdp[gdp['YearQ'] >= rec_start_year_q].copy()
rec_end = gdp[(gdp['GDP'] > gdp['GDP'].shift(1))&
(gdp['GDP'].shift(1) > gdp['GDP'].shift(2))].copy()
rec_end_year_q = rec_end.iloc[0,0]
gdp = gdp[gdp['YearQ'] <= rec_end_year_q].copy()
rec_bottom = gdp[gdp['GDP']==gdp['GDP'].min()].copy()
rec_bottom_year_q = rec_bottom.iloc[0,0]
print('Recession Start : ' + rec_start_year_q)
print(' Bottom : ' + rec_bottom_year_q)
print(' End : ' + rec_end_year_q)
Recession Start : 2008q2
Bottom : 2009q2
End : 2009q4
Convert Housing Data to Quarters
Converts the Zillow housing research data from months to quarters. The data is indexed by State and RegionName, and the columns are for each “YearQ” from 2000q1 through 2016q3. The values in the dataset are median home sales prices.
# This dictionary maps from state acronyms to state names.
states_acronyms_to_names = {'OH': 'Ohio', 'KY': 'Kentucky', 'AS': 'American Samoa', 'NV': 'Nevada', 'WY': 'Wyoming', 'NA': 'National', 'AL': 'Alabama', 'MD': 'Maryland', 'AK': 'Alaska', 'UT': 'Utah', 'OR': 'Oregon', 'MT': 'Montana', 'IL': 'Illinois', 'TN': 'Tennessee', 'DC': 'District of Columbia', 'VT': 'Vermont', 'ID': 'Idaho', 'AR': 'Arkansas', 'ME': 'Maine', 'WA': 'Washington', 'HI': 'Hawaii', 'WI': 'Wisconsin', 'MI': 'Michigan', 'IN': 'Indiana', 'NJ': 'New Jersey', 'AZ': 'Arizona', 'GU': 'Guam', 'MS': 'Mississippi', 'PR': 'Puerto Rico', 'NC': 'North Carolina', 'TX': 'Texas', 'SD': 'South Dakota', 'MP': 'Northern Mariana Islands', 'IA': 'Iowa', 'MO': 'Missouri', 'CT': 'Connecticut', 'WV': 'West Virginia', 'SC': 'South Carolina', 'LA': 'Louisiana', 'KS': 'Kansas', 'NY': 'New York', 'NE': 'Nebraska', 'OK': 'Oklahoma', 'FL': 'Florida', 'CA': 'California', 'CO': 'Colorado', 'PA': 'Pennsylvania', 'DE': 'Delaware', 'NM': 'New Mexico', 'RI': 'Rhode Island', 'MN': 'Minnesota', 'VI': 'Virgin Islands', 'NH': 'New Hampshire', 'MA': 'Massachusetts', 'GA': 'Georgia', 'ND': 'North Dakota', 'VA': 'Virginia'}
# This dictionary maps month number to the appropriate quarter.
month_names_to_qtr = {'01':'q1','02':'q1','03':'q1','04':'q2','05':'q2','06':'q2','07':'q3','08':'q3','09':'q3','10':'q4','11':'q4','12':'q4'}
h = pd.read_csv('data/City_Zhvi_AllHomes.csv')
h['State'] = h['State'].map(states_acronyms_to_names)
h = pd.DataFrame(h.drop(columns=['Metro','CountyName','SizeRank'])
.set_index(['RegionID','State','RegionName'])
.stack()
.reset_index()
.rename(columns={'level_3':'Year-Mo',
0:'MedianPrice'}))
h['YearQtr'] = h['Year-Mo'].str[0:4] + (h['Year-Mo'].str[-2:]
.map(month_names_to_qtr))
h = h[h['YearQtr']>='2000q1'].copy()
h = (h.drop(columns=['Year-Mo'])
.groupby(['RegionID','State','RegionName','YearQtr'])
.mean()
.unstack())
h.index = h.index.droplevel()
h.columns = h.columns.droplevel()
del h.columns.name
h[['2000q1','2000q2','2000q3','2000q4','2001q1']].head()
2000q1 | 2000q2 | 2000q3 | 2000q4 | 2001q1 | ||
---|---|---|---|---|---|---|
State | RegionName | |||||
New Jersey | Absecon | 113500.000000 | 117633.333333 | 118033.333333 | 118833.333333 | 121033.333333 |
Maryland | Accokeek | NaN | NaN | NaN | NaN | NaN |
Oklahoma | Ada | NaN | NaN | NaN | NaN | NaN |
Georgia | Adairsville | 85966.666667 | 87300.000000 | 85100.000000 | 86033.333333 | 86466.666667 |
Tennessee | Adams | 109833.333333 | 111100.000000 | 112133.333333 | 113100.000000 | 114500.000000 |
Run T-Test
Finally, this step performs the t-test using the data cleaned and consolidated in the previous steps.
housing = h[[rec_start_year_q,rec_bottom_year_q]].copy()
housing['Ratio'] = housing[rec_bottom_year_q]/housing[rec_start_year_q]
housing = housing[housing['Ratio'].notna()].copy()
housing = (pd.merge(housing, t,
on=['State','RegionName'],
how='left')
.fillna('Non'))
print(housing['Type'].value_counts())
housing.head()
Non 9599
University Town 257
Name: Type, dtype: int64
State | RegionName | 2008q2 | 2009q2 | Ratio | Type | |
---|---|---|---|---|---|---|
0 | New Jersey | Absecon | 231866.666667 | 225066.666667 | 0.970673 | Non |
1 | Georgia | Adairsville | 114000.000000 | 97000.000000 | 0.850877 | Non |
2 | Tennessee | Adams | 157300.000000 | 153233.333333 | 0.974147 | Non |
3 | Illinois | Addison | 263000.000000 | 237866.666667 | 0.904436 | Non |
4 | Michigan | Albion | 71000.000000 | 65033.333333 | 0.915962 | University Town |
uni_towns = housing[housing['Type']=='University Town'].copy()
non_uni_towns = housing[housing['Type']!='University Town'].copy()
ttest_pval = ttest_ind(uni_towns['Ratio'], non_uni_towns['Ratio'])[1]
print(' University Towns mean price ratio : ' + str(uni_towns['Ratio'].mean())[0:6])
print('Non-University Towns mean price ratio : ' + str(non_uni_towns['Ratio'].mean())[0:6])
print(' T-test P-Value : ' + str(ttest_pval)[0:6])
University Towns mean price ratio : 0.9549
Non-University Towns mean price ratio : 0.9387
T-test P-Value : 0.0033
Note that:
- The P-Value is less than the threshold of 0.01, and
- The mean price ratio for University Towns is greater than the mean price ratio for Non-University Towns.
Given these results, we “reject the null hypothesis,” and conclude that that university homes appear to be less effected by recessions.