Hypothesis Testing using T-Tests

01 Aug 2019

Overview

Cleans, compiles, and tests a dataset to evaluate the hypothesis that the financial value of homes in University towns are less effected by recessions. The hypothesis testing is accomplished by a statistical T-Test.

Datasets

From the Zillow research data site there is housing data for the United States. In particular the datafile for homes at a city level, City_Zhvi_AllHomes.csv, has median home sale prices at a fine grained level.
From the Wikipedia page on college towns is a list of university towns in the United States which has been copy and pasted into the file university_towns.txt.
From Bureau of Economic Analysis, US Department of Commerce, the GDP over time of the United States in current dollars (chained value in 2009 dollars will be used), in quarterly intervals, in the file gdplev.xls.

Definitions

A quarter is a specific three month period, Q1 is January through March, Q2 is April through June, Q3 is July through September, Q4 is October through December.
A recession is defined as starting with two consecutive quarters of GDP decline, and ending with two consecutive quarters of GDP growth.
A recession bottom is the quarter within a recession which had the lowest GDP.
A university town is a city which has a high percentage of university students compared to the total population of the city.
The Alternative Hypothesis, which is ultimately being tested, is that home prices in University towns are impacted less by recessions than homes in non-University towns.
The Null Hypothesis, which we home to “reject,” is that homes in University towns are not impacted less than homes in non-University towns by recessions.
Both tests will be conducted examining the price ratio, defined above, for homes in university towns and non-university towns.

$$price\ ratio = \frac{price\ at\ recession\ bottom}{price\ before\ recession}$$

import pandas as pd
import numpy as np
from scipy.stats import ttest_ind

Get List of University Towns

The result of this step is a list of states and “RegionNames” (or cities) that are labeled as University towns according to Wikipedia.

t = (pd.read_csv('data/university_towns.txt',
                 sep='/n',
                 engine='python',
                 header=None)
     .rename(columns={0:'State'}))
t.loc[~t['State'].str.contains("[edit]", 
                               regex=False), 
      'RegionName'] = t['State']
t.loc[t['State']==t['RegionName'],
      'State'] = np.nan
t['State'] = (t['State'].str.replace(r'\[.*','')
              .fillna(method='ffill'))
t['RegionName'] = t['RegionName'].str.replace(r'\W*\(.*','')
t = (t.drop(t[t['RegionName'].isna()].index)
     .reset_index(drop=True))
t['Type'] = 'University Town'
print('t (rows, columns) = ' + str(t.shape))
t.head()

t (rows, columns) = (517, 3)

	State	RegionName	Type
0	Alabama	Auburn	University Town
1	Alabama	Florence	University Town
2	Alabama	Jacksonville	University Town
3	Alabama	Livingston	University Town
4	Alabama	Montevallo	University Town

Get GDP Timeseries

The result of this step is a list of “YearQ,” or combinations of years and quarter labels, and the corresponding US GDP, as measured in chained value in 2009 dollars.

gdp = (pd.read_excel('data/gdplev.xls',
                     skiprows=7)
       .drop(columns=['Unnamed: 0','Unnamed: 1','Unnamed: 2',
                      'Unnamed: 3','Unnamed: 5','Unnamed: 7'])
       .rename(columns={'Unnamed: 4':'YearQ',
                        'Unnamed: 6':'GDP'}))
gdp = (gdp[gdp['YearQ']>'2000'].copy()
       .reset_index(drop=True))
print('gdp (rows, columns) = ' + str(gdp.shape))
gdp.head()

gdp (rows, columns) = (66, 2)

	YearQ	GDP
0	2000q1	12359.1
1	2000q2	12592.5
2	2000q3	12607.7
3	2000q4	12679.3
4	2001q1	12643.3

Get Critical Points of Recession

This step utilizes the gdp timeseries data in the previous step to determine the start, bottom, and end of the most recent recession, with those terms as defined above.

rec_start = gdp[(gdp['GDP'] > gdp['GDP'].shift(-1))&
                (gdp['GDP'].shift(-1) > gdp['GDP'].shift(-2))].copy()
rec_start_year_q = rec_start.iloc[0,0]

gdp = gdp[gdp['YearQ'] >= rec_start_year_q].copy()
rec_end = gdp[(gdp['GDP'] > gdp['GDP'].shift(1))&
              (gdp['GDP'].shift(1) > gdp['GDP'].shift(2))].copy()
rec_end_year_q = rec_end.iloc[0,0]

gdp = gdp[gdp['YearQ'] <= rec_end_year_q].copy()
rec_bottom = gdp[gdp['GDP']==gdp['GDP'].min()].copy()
rec_bottom_year_q = rec_bottom.iloc[0,0]

print('Recession Start  : ' + rec_start_year_q)
print('          Bottom : ' + rec_bottom_year_q)
print('          End    : ' + rec_end_year_q)

Recession Start  : 2008q2
          Bottom : 2009q2
          End    : 2009q4

Convert Housing Data to Quarters

Converts the Zillow housing research data from months to quarters. The data is indexed by State and RegionName, and the columns are for each “YearQ” from 2000q1 through 2016q3. The values in the dataset are median home sales prices.

# This dictionary maps from state acronyms to state names.
states_acronyms_to_names = {'OH': 'Ohio', 'KY': 'Kentucky', 'AS': 'American Samoa', 'NV': 'Nevada', 'WY': 'Wyoming', 'NA': 'National', 'AL': 'Alabama', 'MD': 'Maryland', 'AK': 'Alaska', 'UT': 'Utah', 'OR': 'Oregon', 'MT': 'Montana', 'IL': 'Illinois', 'TN': 'Tennessee', 'DC': 'District of Columbia', 'VT': 'Vermont', 'ID': 'Idaho', 'AR': 'Arkansas', 'ME': 'Maine', 'WA': 'Washington', 'HI': 'Hawaii', 'WI': 'Wisconsin', 'MI': 'Michigan', 'IN': 'Indiana', 'NJ': 'New Jersey', 'AZ': 'Arizona', 'GU': 'Guam', 'MS': 'Mississippi', 'PR': 'Puerto Rico', 'NC': 'North Carolina', 'TX': 'Texas', 'SD': 'South Dakota', 'MP': 'Northern Mariana Islands', 'IA': 'Iowa', 'MO': 'Missouri', 'CT': 'Connecticut', 'WV': 'West Virginia', 'SC': 'South Carolina', 'LA': 'Louisiana', 'KS': 'Kansas', 'NY': 'New York', 'NE': 'Nebraska', 'OK': 'Oklahoma', 'FL': 'Florida', 'CA': 'California', 'CO': 'Colorado', 'PA': 'Pennsylvania', 'DE': 'Delaware', 'NM': 'New Mexico', 'RI': 'Rhode Island', 'MN': 'Minnesota', 'VI': 'Virgin Islands', 'NH': 'New Hampshire', 'MA': 'Massachusetts', 'GA': 'Georgia', 'ND': 'North Dakota', 'VA': 'Virginia'}

# This dictionary maps month number to the appropriate quarter.
month_names_to_qtr = {'01':'q1','02':'q1','03':'q1','04':'q2','05':'q2','06':'q2','07':'q3','08':'q3','09':'q3','10':'q4','11':'q4','12':'q4'}

h = pd.read_csv('data/City_Zhvi_AllHomes.csv')
h['State'] = h['State'].map(states_acronyms_to_names)

h = pd.DataFrame(h.drop(columns=['Metro','CountyName','SizeRank'])
                 .set_index(['RegionID','State','RegionName'])
                 .stack()
                 .reset_index()
                 .rename(columns={'level_3':'Year-Mo',
                                  0:'MedianPrice'}))
h['YearQtr'] = h['Year-Mo'].str[0:4] + (h['Year-Mo'].str[-2:]
                                        .map(month_names_to_qtr))
h = h[h['YearQtr']>='2000q1'].copy()
h = (h.drop(columns=['Year-Mo'])
     .groupby(['RegionID','State','RegionName','YearQtr'])
     .mean()
     .unstack())
h.index = h.index.droplevel()
h.columns = h.columns.droplevel()
del h.columns.name
h[['2000q1','2000q2','2000q3','2000q4','2001q1']].head()

		2000q1	2000q2	2000q3	2000q4	2001q1
State	RegionName
New Jersey	Absecon	113500.000000	117633.333333	118033.333333	118833.333333	121033.333333
Maryland	Accokeek	NaN	NaN	NaN	NaN	NaN
Oklahoma	Ada	NaN	NaN	NaN	NaN	NaN
Georgia	Adairsville	85966.666667	87300.000000	85100.000000	86033.333333	86466.666667
Tennessee	Adams	109833.333333	111100.000000	112133.333333	113100.000000	114500.000000

Run T-Test

Finally, this step performs the t-test using the data cleaned and consolidated in the previous steps.

housing = h[[rec_start_year_q,rec_bottom_year_q]].copy()
housing['Ratio'] = housing[rec_bottom_year_q]/housing[rec_start_year_q]
housing = housing[housing['Ratio'].notna()].copy()
housing = (pd.merge(housing, t,
                    on=['State','RegionName'],
                    how='left')
           .fillna('Non'))
print(housing['Type'].value_counts())
housing.head()

Non                9599
University Town     257
Name: Type, dtype: int64

	State	RegionName	2008q2	2009q2	Ratio	Type
0	New Jersey	Absecon	231866.666667	225066.666667	0.970673	Non
1	Georgia	Adairsville	114000.000000	97000.000000	0.850877	Non
2	Tennessee	Adams	157300.000000	153233.333333	0.974147	Non
3	Illinois	Addison	263000.000000	237866.666667	0.904436	Non
4	Michigan	Albion	71000.000000	65033.333333	0.915962	University Town

uni_towns = housing[housing['Type']=='University Town'].copy()
non_uni_towns = housing[housing['Type']!='University Town'].copy()

ttest_pval = ttest_ind(uni_towns['Ratio'], non_uni_towns['Ratio'])[1]
print('    University Towns mean price ratio : ' + str(uni_towns['Ratio'].mean())[0:6])
print('Non-University Towns mean price ratio : ' + str(non_uni_towns['Ratio'].mean())[0:6])
print('                       T-test P-Value : ' + str(ttest_pval)[0:6])

    University Towns mean price ratio : 0.9549
Non-University Towns mean price ratio : 0.9387
                       T-test P-Value : 0.0033

Note that:

The P-Value is less than the threshold of 0.01, and
The mean price ratio for University Towns is greater than the mean price ratio for Non-University Towns.

Given these results, we “reject the null hypothesis,” and conclude that that university homes appear to be less effected by recessions.

Some content in these notes is taken from the Assignment 4.xlsx spreadsheet accessible as part of the Introduction to Data Science in Python course on coursera.org.