Hypothesis Testing using T-Tests

Overview

Cleans, compiles, and tests a dataset to evaluate the hypothesis that the financial value of homes in University towns are less effected by recessions. The hypothesis testing is accomplished by a statistical T-Test.

Datasets

  • From the Zillow research data site there is housing data for the United States. In particular the datafile for homes at a city level, City_Zhvi_AllHomes.csv, has median home sale prices at a fine grained level.
  • From the Wikipedia page on college towns is a list of university towns in the United States which has been copy and pasted into the file university_towns.txt.
  • From Bureau of Economic Analysis, US Department of Commerce, the GDP over time of the United States in current dollars (chained value in 2009 dollars will be used), in quarterly intervals, in the file gdplev.xls.

Definitions

  • A quarter is a specific three month period, Q1 is January through March, Q2 is April through June, Q3 is July through September, Q4 is October through December.
  • A recession is defined as starting with two consecutive quarters of GDP decline, and ending with two consecutive quarters of GDP growth.
  • A recession bottom is the quarter within a recession which had the lowest GDP.
  • A university town is a city which has a high percentage of university students compared to the total population of the city.

  • The Alternative Hypothesis, which is ultimately being tested, is that home prices in University towns are impacted less by recessions than homes in non-University towns.

  • The Null Hypothesis, which we home to “reject,” is that homes in University towns are not impacted less than homes in non-University towns by recessions.

  • Both tests will be conducted examining the price ratio, defined above, for homes in university towns and non-university towns.

$$price\ ratio = \frac{price\ at\ recession\ bottom}{price\ before\ recession}$$

import pandas as pd
import numpy as np
from scipy.stats import ttest_ind

Get List of University Towns

The result of this step is a list of states and “RegionNames” (or cities) that are labeled as University towns according to Wikipedia.

t = (pd.read_csv('data/university_towns.txt',
                 sep='/n',
                 engine='python',
                 header=None)
     .rename(columns={0:'State'}))
t.loc[~t['State'].str.contains("[edit]", 
                               regex=False), 
      'RegionName'] = t['State']
t.loc[t['State']==t['RegionName'],
      'State'] = np.nan
t['State'] = (t['State'].str.replace(r'\[.*','')
              .fillna(method='ffill'))
t['RegionName'] = t['RegionName'].str.replace(r'\W*\(.*','')
t = (t.drop(t[t['RegionName'].isna()].index)
     .reset_index(drop=True))
t['Type'] = 'University Town'
print('t (rows, columns) = ' + str(t.shape))
t.head()
t (rows, columns) = (517, 3)
State RegionName Type
0 Alabama Auburn University Town
1 Alabama Florence University Town
2 Alabama Jacksonville University Town
3 Alabama Livingston University Town
4 Alabama Montevallo University Town

Get GDP Timeseries

The result of this step is a list of “YearQ,” or combinations of years and quarter labels, and the corresponding US GDP, as measured in chained value in 2009 dollars.

gdp = (pd.read_excel('data/gdplev.xls',
                     skiprows=7)
       .drop(columns=['Unnamed: 0','Unnamed: 1','Unnamed: 2',
                      'Unnamed: 3','Unnamed: 5','Unnamed: 7'])
       .rename(columns={'Unnamed: 4':'YearQ',
                        'Unnamed: 6':'GDP'}))
gdp = (gdp[gdp['YearQ']>'2000'].copy()
       .reset_index(drop=True))
print('gdp (rows, columns) = ' + str(gdp.shape))
gdp.head()
gdp (rows, columns) = (66, 2)
YearQ GDP
0 2000q1 12359.1
1 2000q2 12592.5
2 2000q3 12607.7
3 2000q4 12679.3
4 2001q1 12643.3

Get Critical Points of Recession

This step utilizes the gdp timeseries data in the previous step to determine the start, bottom, and end of the most recent recession, with those terms as defined above.

rec_start = gdp[(gdp['GDP'] > gdp['GDP'].shift(-1))&
                (gdp['GDP'].shift(-1) > gdp['GDP'].shift(-2))].copy()
rec_start_year_q = rec_start.iloc[0,0]

gdp = gdp[gdp['YearQ'] >= rec_start_year_q].copy()
rec_end = gdp[(gdp['GDP'] > gdp['GDP'].shift(1))&
              (gdp['GDP'].shift(1) > gdp['GDP'].shift(2))].copy()
rec_end_year_q = rec_end.iloc[0,0]

gdp = gdp[gdp['YearQ'] <= rec_end_year_q].copy()
rec_bottom = gdp[gdp['GDP']==gdp['GDP'].min()].copy()
rec_bottom_year_q = rec_bottom.iloc[0,0]

print('Recession Start  : ' + rec_start_year_q)
print('          Bottom : ' + rec_bottom_year_q)
print('          End    : ' + rec_end_year_q)
Recession Start  : 2008q2
          Bottom : 2009q2
          End    : 2009q4

Convert Housing Data to Quarters

Converts the Zillow housing research data from months to quarters. The data is indexed by State and RegionName, and the columns are for each “YearQ” from 2000q1 through 2016q3. The values in the dataset are median home sales prices.

# This dictionary maps from state acronyms to state names.
states_acronyms_to_names = {'OH': 'Ohio', 'KY': 'Kentucky', 'AS': 'American Samoa', 'NV': 'Nevada', 'WY': 'Wyoming', 'NA': 'National', 'AL': 'Alabama', 'MD': 'Maryland', 'AK': 'Alaska', 'UT': 'Utah', 'OR': 'Oregon', 'MT': 'Montana', 'IL': 'Illinois', 'TN': 'Tennessee', 'DC': 'District of Columbia', 'VT': 'Vermont', 'ID': 'Idaho', 'AR': 'Arkansas', 'ME': 'Maine', 'WA': 'Washington', 'HI': 'Hawaii', 'WI': 'Wisconsin', 'MI': 'Michigan', 'IN': 'Indiana', 'NJ': 'New Jersey', 'AZ': 'Arizona', 'GU': 'Guam', 'MS': 'Mississippi', 'PR': 'Puerto Rico', 'NC': 'North Carolina', 'TX': 'Texas', 'SD': 'South Dakota', 'MP': 'Northern Mariana Islands', 'IA': 'Iowa', 'MO': 'Missouri', 'CT': 'Connecticut', 'WV': 'West Virginia', 'SC': 'South Carolina', 'LA': 'Louisiana', 'KS': 'Kansas', 'NY': 'New York', 'NE': 'Nebraska', 'OK': 'Oklahoma', 'FL': 'Florida', 'CA': 'California', 'CO': 'Colorado', 'PA': 'Pennsylvania', 'DE': 'Delaware', 'NM': 'New Mexico', 'RI': 'Rhode Island', 'MN': 'Minnesota', 'VI': 'Virgin Islands', 'NH': 'New Hampshire', 'MA': 'Massachusetts', 'GA': 'Georgia', 'ND': 'North Dakota', 'VA': 'Virginia'}
# This dictionary maps month number to the appropriate quarter.
month_names_to_qtr = {'01':'q1','02':'q1','03':'q1','04':'q2','05':'q2','06':'q2','07':'q3','08':'q3','09':'q3','10':'q4','11':'q4','12':'q4'}
h = pd.read_csv('data/City_Zhvi_AllHomes.csv')
h['State'] = h['State'].map(states_acronyms_to_names)

h = pd.DataFrame(h.drop(columns=['Metro','CountyName','SizeRank'])
                 .set_index(['RegionID','State','RegionName'])
                 .stack()
                 .reset_index()
                 .rename(columns={'level_3':'Year-Mo',
                                  0:'MedianPrice'}))
h['YearQtr'] = h['Year-Mo'].str[0:4] + (h['Year-Mo'].str[-2:]
                                        .map(month_names_to_qtr))
h = h[h['YearQtr']>='2000q1'].copy()
h = (h.drop(columns=['Year-Mo'])
     .groupby(['RegionID','State','RegionName','YearQtr'])
     .mean()
     .unstack())
h.index = h.index.droplevel()
h.columns = h.columns.droplevel()
del h.columns.name
h[['2000q1','2000q2','2000q3','2000q4','2001q1']].head()
2000q1 2000q2 2000q3 2000q4 2001q1
State RegionName
New Jersey Absecon 113500.000000 117633.333333 118033.333333 118833.333333 121033.333333
Maryland Accokeek NaN NaN NaN NaN NaN
Oklahoma Ada NaN NaN NaN NaN NaN
Georgia Adairsville 85966.666667 87300.000000 85100.000000 86033.333333 86466.666667
Tennessee Adams 109833.333333 111100.000000 112133.333333 113100.000000 114500.000000

Run T-Test

Finally, this step performs the t-test using the data cleaned and consolidated in the previous steps.

housing = h[[rec_start_year_q,rec_bottom_year_q]].copy()
housing['Ratio'] = housing[rec_bottom_year_q]/housing[rec_start_year_q]
housing = housing[housing['Ratio'].notna()].copy()
housing = (pd.merge(housing, t,
                    on=['State','RegionName'],
                    how='left')
           .fillna('Non'))
print(housing['Type'].value_counts())
housing.head()
Non                9599
University Town     257
Name: Type, dtype: int64
State RegionName 2008q2 2009q2 Ratio Type
0 New Jersey Absecon 231866.666667 225066.666667 0.970673 Non
1 Georgia Adairsville 114000.000000 97000.000000 0.850877 Non
2 Tennessee Adams 157300.000000 153233.333333 0.974147 Non
3 Illinois Addison 263000.000000 237866.666667 0.904436 Non
4 Michigan Albion 71000.000000 65033.333333 0.915962 University Town
uni_towns = housing[housing['Type']=='University Town'].copy()
non_uni_towns = housing[housing['Type']!='University Town'].copy()

ttest_pval = ttest_ind(uni_towns['Ratio'], non_uni_towns['Ratio'])[1]
print('    University Towns mean price ratio : ' + str(uni_towns['Ratio'].mean())[0:6])
print('Non-University Towns mean price ratio : ' + str(non_uni_towns['Ratio'].mean())[0:6])
print('                       T-test P-Value : ' + str(ttest_pval)[0:6])
    University Towns mean price ratio : 0.9549
Non-University Towns mean price ratio : 0.9387
                       T-test P-Value : 0.0033

Note that:

  • The P-Value is less than the threshold of 0.01, and
  • The mean price ratio for University Towns is greater than the mean price ratio for Non-University Towns.

Given these results, we “reject the null hypothesis,” and conclude that that university homes appear to be less effected by recessions.