Analyze A/B Test Results

Ryan Wingate
Completed: February 12, 2018

Table of Contents

Introduction

A/B tests are very commonly performed by data analysts and data scientists.

For this project, I work to understand the results of a hypothetical A/B test run by an e-commerce website. The goal is to determine whether the company should implement the new page, keep the old page, or gather more information by running the experiment longer.

Part 1 - Probability

In [1]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy.stats import norm
%matplotlib inline
random.seed(42)
/Users/ryanwingate/anaconda3/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools
In [2]:
df = pd.read_csv('ab_data.csv')
df.head()
Out[2]:
user_id timestamp group landing_page converted
0 851104 2017-01-21 22:11:48.556739 control old_page 0
1 804228 2017-01-12 08:01:45.159739 control old_page 0
2 661590 2017-01-11 16:55:06.154213 treatment new_page 0
3 853541 2017-01-08 18:28:03.143765 treatment new_page 0
4 864975 2017-01-21 01:52:26.210827 control old_page 1
In [3]:
print('Row count = ' + str(df.shape[0]))
Row count = 294478
In [4]:
unique_users = df.user_id.nunique()
print('Unique users in the dataset = ' + str(unique_users))
Unique users in the dataset = 290584
In [5]:
un_conv_users = df[df.converted == 1].user_id.nunique()
un_users_conv_prop = un_conv_users/unique_users
print('unique converted user count = {}'\
      .format(un_conv_users))
print('   % unique users converted = {:.1f}'\
      .format(un_users_conv_prop*100))
unique converted user count = 35173
   % unique users converted = 12.1
In [6]:
df.isnull().sum()
Out[6]:
user_id         0
timestamp       0
group           0
landing_page    0
converted       0
dtype: int64

There are no null values in the dataframe.

Next, determine the number of times the new_page and treatment don't coincide.

In [7]:
df.groupby(['group', 'landing_page']).count()
Out[7]:
user_id timestamp converted
group landing_page
control new_page 1928 1928 1928
old_page 145274 145274 145274
treatment new_page 145311 145311 145311
old_page 1965 1965 1965
In [8]:
user_count_treatment_and_old = \
    df[(df.group == 'treatment') & \
       (df.landing_page == 'old_page')].shape[0]
user_count_control_and_new = \
    df[(df.group == 'control') & \
       (df.landing_page == 'new_page')].shape[0]

print('# times treatment group receives old_page: {}'\
      .format(user_count_treatment_and_old))
print('  # times control group receives new_page: {}'\
      .format(user_count_control_and_new))
# times treatment group receives old_page: 1965
  # times control group receives new_page: 1928

Drop the rows where treatment is not aligned with new_page or control is not aligned with old_page, because I cannot be sure what page these users received.

In [9]:
drop_index = df[((df.group == 'treatment') & \
                 (df.landing_page == 'old_page')) | \
                ((df.group == 'control') & \
                 (df.landing_page == 'new_page'))].index

users = df.drop(drop_index)
In [10]:
user_count_treatment_and_old = \
    users[(users.group == 'treatment') & \
          (users.landing_page == 'old_page')].shape[0]
user_count_control_and_new = \
    users[(users.group == 'control') & \
          (users.landing_page == 'new_page')].shape[0]

print('# times treatment group receives old_page: {}'\
      .format(user_count_treatment_and_old))
print('  # times control group receives new_page: {}'\
      .format(user_count_control_and_new))
# times treatment group receives old_page: 0
  # times control group receives new_page: 0
In [11]:
print(' Unique user count = {}'\
      .format(users.user_id.nunique()))
print('Total user entries = {}'\
      .format(users.shape[0]))
 Unique user count = 290584
Total user entries = 290585

Eliminate any duplicate users.

In [12]:
users[users.duplicated(subset='user_id', keep=False)]
Out[12]:
user_id timestamp group landing_page converted
1899 773192 2017-01-09 05:37:58.781806 treatment new_page 0
2893 773192 2017-01-14 02:55:59.590927 treatment new_page 0
In [13]:
users = users.drop(
    users[users.duplicated(subset='user_id', keep='last')].index)
In [14]:
users[users.duplicated(subset='user_id', keep=False)]
Out[14]:
user_id timestamp group landing_page converted

Other Probabilities

In [15]:
print('Probability of an individual converting \n'\
      + '   regardless of the page they receive:                {:.2f}%'\
      .format(users.converted.mean()*100))

control_conv = users[users['group'] == 'control'].converted.mean()
print('Given that an individual was in the \n'\
      + '   control group, the probability they convert:        {:.2f}%'\
      .format(control_conv*100))

treatment_conv = users[users['group'] == 'treatment'].converted.mean()
print('Given that an individual was in the \n'\
      + '   treatment group, the probability they convert:      {:.2f}%'\
      .format(treatment_conv*100))
new_page_prop = users[users.landing_page == 'new_page'].shape[0]\
    / users.shape[0]

print('Probability that an individual received the new page:  {:.2f}%'\
      .format(new_page_prop*100))
Probability of an individual converting 
   regardless of the page they receive:                11.96%
Given that an individual was in the 
   control group, the probability they convert:        12.04%
Given that an individual was in the 
   treatment group, the probability they convert:      11.88%
Probability that an individual received the new page:  50.01%

At this point of the analysis, there does not appear to be sufficient evidence to conclude that the new treatment page produces more conversions than the current control page. The probability that a user who receives the treatment page will convert is actually slightly less than the probability that a user who receives the control page will convert.

Part 2 - A/B Test

Because there is a time stamp associated with each user event, it is theoretically possible that a hypothesis test, similar to this one, could be run continuously in real time.

In this hypothetical situation, the hard questions become when to stop the test and how to reach a decision: As soon as one page is considered significantly better than the other? Once a certain page is better than another for a certain period of time? How long must the experiment be run before considering the test inconclusive? How much evidence is sufficient?

These are the difficult questions associated with A/B tests in general.

For the hypothetical A/B test considered in this section, the null and alternative hypotheses are as follows:

$$H_0: p_{new} - p_{old} \leq 0$$ $$H_1: p_{new} - p_{old} > 0$$

Or, verbally:

$H_{0}$: The likelihood of conversion for a user receiving the new page is less than or equal to the likelihood of conversion for a user receiving the old page.

$H_{1}$: The likelihood of conversion for a user receiving the new page is greater than the likelihood of conversion for a user receiving the old page.

For the next section, assume the null hypothesis. That is, assume $p_{new}$ and $p_{old}$ both have "true" success rates equal to the converted success rate regardless of page.

Convert rate for $p_{new}$ and $p_{old}$ under the null?

In [16]:
p_new = users.converted.mean()
p_old = users.converted.mean()
print('p_new = {:.2f}%'.format(p_new*100))
print('p_old = {:.2f}%'.format(p_old*100))
p_new = 11.96%
p_old = 11.96%

Also, Use a sample size for each page equal to the ones in ab_data.csv.

What are $n_{new}$ and $n_{old}$?

In [17]:
n_new = users[users.landing_page == 'new_page'].user_id.count()
n_old = users[users.landing_page == 'old_page'].user_id.count()
print('n_new = {}'.format(n_new))
print('n_old = {}'.format(n_old))
n_new = 145310
n_old = 145274

Next, perform the sampling distribution for the difference in converted between the two pages over 10,000 iterations of calculating an estimate from the null.

Simulate $n_{new}$ transactions with a convert rate of $p_{new}$ under the null. Store these $n_{new}$ 1's and 0's in new_page_converted.

Simulate $n_{old}$ transactions with a convert rate of $p_{old}$ under the null. Store these $n_{old}$ 1's and 0's in old_page_converted.

In [18]:
new_page_converted = \
    np.random.choice([0, 1], size=n_new, p=[(1-p_new), p_new])
In [19]:
old_page_converted = \
    np.random.choice([0, 1], size=n_old, p=[(1-p_old), p_old])

Next, find $p_{new}$ - $p_{old}$ for those simulated values.

In [20]:
new_page_converted.mean() - old_page_converted.mean()
Out[20]:
0.00030752980994586121

Next, simulate 10,000 $p_{new}$ - $p_{old}$ values using this same process. Store all 10,000 values in a numpy array called p_diffs.

In [21]:
new_converted_simulation = \
    np.random.binomial(n_new, p_new, 10000)/n_new
old_converted_simulation = \
    np.random.binomial(n_old, p_old, 10000)/n_old
p_diffs = new_converted_simulation - old_converted_simulation

Plot a histogram of the p_diffs.

This boils down to a computation of the "spread" of the data, assuming that the probability of converting a given user is the same whether they see the treatment page or the control page.

In [22]:
obs_diff = treatment_conv - control_conv

plt.hist(p_diffs);
plt.axvline(x=obs_diff, color='red');

What proportion of the p_diffs are greater than the actual difference observed in ab_data.csv?

In [23]:
p_value = (np.array(p_diffs) > obs_diff).mean()
print('p_value = {:.3f}'.format(p_value))
p_value = 0.906

The histogram plotted in part i contains the sampling distribution under the null hypothesis, namely, that the conversion rate of the control group is equal to the conversion rate of the treatment group. The foregoing involves calculating what proportion of the conversion rate differences were greater than the actual observed difference, which was calculated from the conversion rate data. The special name given to the proportion of values in the null distribution that were greater than our observed difference is the "p-value."

A low p-value (specifically, less than our alpha of 0.05) indicates that the null hypothesis is not likely to be true. Since the p-value is very large at 90%, it is likely that our statistic is from the null, and therefore we fail to reject the null hypothesis. Ultimately, this indicates that it would be best for the hypothetical e-commerce company to keep the current page.

Built-In Stats Model

I could also use a built-in to achieve similar results. Though using the built-in might be easier to code, the above portions are a walkthrough of the ideas that are critical to correctly thinking about statistical significance. Here is a helpful link on using the built in.

convert_old and convert_new refer to the number of conversions for each page.

In [24]:
convert_old = users[users['group'] == 'control'].converted.sum()
convert_new = users[users['group'] == 'treatment'].converted.sum()
In [25]:
z_score, p_value = \
    sm.stats.proportions_ztest([convert_new, convert_old],\
                               [n_new, n_old],\
                               alternative='larger')

print('Z-score critical value (95% confidence) to \n'\
      + '    reject the null: '\
      + str(norm.ppf(1-(0.05/2))))
    
print('z_score = ' + str(z_score))
print('p_value = ' + str(p_value))
Z-score critical value (95% confidence) to 
    reject the null: 1.95996398454
z_score = -1.31092419842
p_value = 0.905058312759

Since the magnitude of the z-score of 1.31 falls within the range implied by the critical value of 1.96, I fail to reject the null hypothesis. The null hypothesis is that there is no statistical difference between the conversion rates for the control and treatment groups.

Additionally, since the p_value of 0.90 (note, approximately the same value as was calculated manually) is larger than the alpha value of 0.05, we fail to reject the null hypothesis.

Thus, for both the foregoing reasons, the built-in method leads to the same conclusion as the manual method.

Part 3 - Regression Approach

This final part demonstrates that the result acheived in the previous A/B test can also be acheived by performing regression.

Logistic regression is the type of regression that should be performed in this situation, since each row is either a conversion or no conversion.

In this section, I utilize statsmodels to fit a logistic regression model to see if there is a significant difference in conversion depending on which page a customer receives.

First, I add a column for the intercept, called intercept. Then I create a dummy variable column for which page each user received. That column is called ab_page. It is 1 when a user receives treatment and 0 if the user receives control.

In [26]:
users['intercept'] = 1

users[['drop', 'ab_page']] = pd.get_dummies(users['group'])
users = users.drop(['drop'], axis=1)
users.head()
Out[26]:
user_id timestamp group landing_page converted intercept ab_page
0 851104 2017-01-21 22:11:48.556739 control old_page 0 1 0
1 804228 2017-01-12 08:01:45.159739 control old_page 0 1 0
2 661590 2017-01-11 16:55:06.154213 treatment new_page 0 1 1
3 853541 2017-01-08 18:28:03.143765 treatment new_page 0 1 1
4 864975 2017-01-21 01:52:26.210827 control old_page 1 1 0

The following line instantiates the model.

In [27]:
logit_mod = sm.Logit(users['converted'], users[['intercept', 'ab_page']])

The following line fits the model using the intercept and ab_page columns. The model predicts whether or not a user converts.

In [28]:
results = logit_mod.fit()
Optimization terminated successfully.
         Current function value: 0.366118
         Iterations 6

The following line prints a summary of the model.

In [29]:
results.summary()
Out[29]:
Logit Regression Results
Dep. Variable: converted No. Observations: 290584
Model: Logit Df Residuals: 290582
Method: MLE Df Model: 1
Date: Tue, 08 May 2018 Pseudo R-squ.: 8.077e-06
Time: 19:37:21 Log-Likelihood: -1.0639e+05
converged: True LL-Null: -1.0639e+05
LLR p-value: 0.1899
coef std err z P>|z| [0.025 0.975]
intercept -1.9888 0.008 -246.669 0.000 -2.005 -1.973
ab_page -0.0150 0.011 -1.311 0.190 -0.037 0.007

The p-value associated with ab_page in this regression model is 0.19. The p-value that was returned from the built-in ztest method was ~0.90. The p-value that I calculated manually was also ~0.90.

The null hypothesis associated with a logistic regression is that there is no relationship between the dependent and independent variables. In this case, this means there is no relationship between which page a user is shown and the conversion rate. The alternative hypothesis would therefore be that there is a relationship of some sort.

The null hypothesis from part 2 is that the likelihood of conversion for a user receiving the new page is less than or equal to the likelihood of conversion for a user receiving the old page. The alternative hypothesis from part 2 is that the likelihood of conversion for a user receiving the new page is greater than the likelihood of conversion for a user receiving the old page.

The factor that accounts for the large difference in the p-values may be that part 2 hypothesized one of the pages (specifically, the new_page the treatment group received) would lead to more conversions than the other. This is different from the hypotheses of part 3, which merely predicted a difference of some sort.

Including additional factors may make the model more predictive, yielding greater understanding. It may also result in business insights that would not have been evident in this simpler anlysis. For example, it would be possible to have different versions of the website for different locations. It is likely that people from different countries might have different tastes in website layout.

Possible disadvantages of additional factors include increased risk of human error, especially misinterpretation, as well as possibly obscuring the message the data is really trying to tell (decreasing the so-called signal-to-noise ratio).

Additional Factors

Next, I add an additional factor for the country in which a user lives. I read in an additional file called countries.csv and merge it with the users dataframe.

For future reference, here are the docs for joining tables.

Finally, I generate dummy variables for these country columns.

In [30]:
countries_df = pd.read_csv('countries.csv')
users_country = countries_df.set_index('user_id')\
    .join(users.set_index('user_id'), how='inner')
In [31]:
users_country.head()
Out[31]:
country timestamp group landing_page converted intercept ab_page
user_id
834778 UK 2017-01-14 23:08:43.304998 control old_page 0 1 0
928468 US 2017-01-23 14:44:16.387854 treatment new_page 0 1 1
822059 UK 2017-01-16 14:04:14.719771 treatment new_page 1 1 1
711597 UK 2017-01-22 03:14:24.763511 control old_page 0 1 0
710616 UK 2017-01-16 13:14:44.000513 treatment new_page 0 1 1
In [32]:
users_country[['CA', 'UK', 'US']] = \
    pd.get_dummies(users_country['country'])
In [33]:
logit_mod_country = sm.Logit(users_country['converted'],\
    users_country[['intercept', 'ab_page', 'US', 'UK']])
results_country = logit_mod_country.fit()
results_country.summary()
Optimization terminated successfully.
         Current function value: 0.366113
         Iterations 6
Out[33]:
Logit Regression Results
Dep. Variable: converted No. Observations: 290584
Model: Logit Df Residuals: 290580
Method: MLE Df Model: 3
Date: Tue, 08 May 2018 Pseudo R-squ.: 2.323e-05
Time: 19:37:22 Log-Likelihood: -1.0639e+05
converged: True LL-Null: -1.0639e+05
LLR p-value: 0.1760
coef std err z P>|z| [0.025 0.975]
intercept -2.0300 0.027 -76.249 0.000 -2.082 -1.978
ab_page -0.0149 0.011 -1.307 0.191 -0.037 0.007
US 0.0408 0.027 1.516 0.130 -0.012 0.093
UK 0.0506 0.028 1.784 0.074 -0.005 0.106

It is necessary to exponentiate these coefficients since this is logistic regression.

In [34]:
US_coeff = 0.0408
UK_coeff = 0.0506

print('Exponentiating the US coefficient {} yields {:.3f}'.\
     format(US_coeff, np.exp(US_coeff)))
print('Exponentiating the UK coefficient {} yields {:.3f}'.\
     format(UK_coeff, np.exp(UK_coeff)))
Exponentiating the US coefficient 0.0408 yields 1.042
Exponentiating the UK coefficient 0.0506 yields 1.052

The interpretation of the foregoing variables is counterintuitive. In this case, Canada is the baseline since it was the one out of three variables that wasn't included in the regression. We would say that US users are 1.04 times as likely (or 4% more likely) to convert as Canadians users. Similarly, we would say that UK users are 1.05 times as likely (or 5% more likely) to convert as Canadian users.

The effect is not statistically significant, given the fairly large P-values. Even if it were, it is not clear that such a small difference between the different countries would be practically significant.

Interaction Terms

Having examined the impact of the individual factors of country and page on conversion, I turn my attention to the interaction between page and country.

These new columns indicate that a given user received both the new page and lived in the indicated country.

Provide the summary results, and your conclusions based on the results.

In [35]:
users_country['CA_trt'] = users_country['ab_page'] * users_country['CA']
users_country['UK_trt'] = users_country['ab_page'] * users_country['UK']
users_country['US_trt'] = users_country['ab_page'] * users_country['US']
In [36]:
users_country.head()
Out[36]:
country timestamp group landing_page converted intercept ab_page CA UK US CA_trt UK_trt US_trt
user_id
834778 UK 2017-01-14 23:08:43.304998 control old_page 0 1 0 0 1 0 0 0 0
928468 US 2017-01-23 14:44:16.387854 treatment new_page 0 1 1 0 0 1 0 0 1
822059 UK 2017-01-16 14:04:14.719771 treatment new_page 1 1 1 0 1 0 0 1 0
711597 UK 2017-01-22 03:14:24.763511 control old_page 0 1 0 0 1 0 0 0 0
710616 UK 2017-01-16 13:14:44.000513 treatment new_page 0 1 1 0 1 0 0 1 0
In [37]:
users_country[['country', 'group', 'CA', 'UK', 'US', \
               'CA_trt', 'UK_trt', 'US_trt']].head()
Out[37]:
country group CA UK US CA_trt UK_trt US_trt
user_id
834778 UK control 0 1 0 0 0 0
928468 US treatment 0 0 1 0 0 1
822059 UK treatment 0 1 0 0 1 0
711597 UK control 0 1 0 0 0 0
710616 UK treatment 0 1 0 0 1 0
In [38]:
log_mod2 = sm.Logit(users_country['converted'], \
    users_country[['intercept', 'ab_page', \
                  'US', 'US_trt', 'UK', 'UK_trt']])
results_log2 = log_mod2.fit()
results_log2.summary()
Optimization terminated successfully.
         Current function value: 0.366109
         Iterations 6
Out[38]:
Logit Regression Results
Dep. Variable: converted No. Observations: 290584
Model: Logit Df Residuals: 290578
Method: MLE Df Model: 5
Date: Tue, 08 May 2018 Pseudo R-squ.: 3.482e-05
Time: 19:37:23 Log-Likelihood: -1.0639e+05
converged: True LL-Null: -1.0639e+05
LLR p-value: 0.1920
coef std err z P>|z| [0.025 0.975]
intercept -2.0040 0.036 -55.008 0.000 -2.075 -1.933
ab_page -0.0674 0.052 -1.297 0.195 -0.169 0.034
US 0.0175 0.038 0.465 0.642 -0.056 0.091
US_trt 0.0469 0.054 0.872 0.383 -0.059 0.152
UK 0.0118 0.040 0.296 0.767 -0.066 0.090
UK_trt 0.0783 0.057 1.378 0.168 -0.033 0.190

The foregoing present a logistic regression for the case with the interaction terms. The interaction terms are not especially predictive of results, given the high P-values shown in the results.

Conclusions

Somewhat anticlimactic, but the conclusion of this study has to be that the the new page does not have a statistically significant impact on new user conversions.

Absent additional testing providing information to the contrary, the e-commerce company should keep the old web page.