Ryan Wingate

Completed: February 12, 2018

A/B tests are very commonly performed by data analysts and data scientists.

For this project, I work to understand the results of a hypothetical A/B test run by an e-commerce website. The goal is to determine whether the company should implement the new page, keep the old page, or gather more information by running the experiment longer.

In [1]:

```
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy.stats import norm
%matplotlib inline
random.seed(42)
```

In [2]:

```
df = pd.read_csv('ab_data.csv')
df.head()
```

Out[2]:

In [3]:

```
print('Row count = ' + str(df.shape[0]))
```

In [4]:

```
unique_users = df.user_id.nunique()
print('Unique users in the dataset = ' + str(unique_users))
```

In [5]:

```
un_conv_users = df[df.converted == 1].user_id.nunique()
un_users_conv_prop = un_conv_users/unique_users
print('unique converted user count = {}'\
.format(un_conv_users))
print(' % unique users converted = {:.1f}'\
.format(un_users_conv_prop*100))
```

In [6]:

```
df.isnull().sum()
```

Out[6]:

There are no null values in the dataframe.

Next, determine the number of times the `new_page`

and `treatment`

don't coincide.

In [7]:

```
df.groupby(['group', 'landing_page']).count()
```

Out[7]:

In [8]:

```
user_count_treatment_and_old = \
df[(df.group == 'treatment') & \
(df.landing_page == 'old_page')].shape[0]
user_count_control_and_new = \
df[(df.group == 'control') & \
(df.landing_page == 'new_page')].shape[0]
print('# times treatment group receives old_page: {}'\
.format(user_count_treatment_and_old))
print(' # times control group receives new_page: {}'\
.format(user_count_control_and_new))
```

Drop the rows where `treatment`

is not aligned with `new_page`

or `control`

is not aligned with `old_page`

, because I cannot be sure what page these users received.

In [9]:

```
drop_index = df[((df.group == 'treatment') & \
(df.landing_page == 'old_page')) | \
((df.group == 'control') & \
(df.landing_page == 'new_page'))].index
users = df.drop(drop_index)
```

In [10]:

```
user_count_treatment_and_old = \
users[(users.group == 'treatment') & \
(users.landing_page == 'old_page')].shape[0]
user_count_control_and_new = \
users[(users.group == 'control') & \
(users.landing_page == 'new_page')].shape[0]
print('# times treatment group receives old_page: {}'\
.format(user_count_treatment_and_old))
print(' # times control group receives new_page: {}'\
.format(user_count_control_and_new))
```

In [11]:

```
print(' Unique user count = {}'\
.format(users.user_id.nunique()))
print('Total user entries = {}'\
.format(users.shape[0]))
```

Eliminate any duplicate users.

In [12]:

```
users[users.duplicated(subset='user_id', keep=False)]
```

Out[12]:

In [13]:

```
users = users.drop(
users[users.duplicated(subset='user_id', keep='last')].index)
```

In [14]:

```
users[users.duplicated(subset='user_id', keep=False)]
```

Out[14]:

In [15]:

```
print('Probability of an individual converting \n'\
+ ' regardless of the page they receive: {:.2f}%'\
.format(users.converted.mean()*100))
control_conv = users[users['group'] == 'control'].converted.mean()
print('Given that an individual was in the \n'\
+ ' control group, the probability they convert: {:.2f}%'\
.format(control_conv*100))
treatment_conv = users[users['group'] == 'treatment'].converted.mean()
print('Given that an individual was in the \n'\
+ ' treatment group, the probability they convert: {:.2f}%'\
.format(treatment_conv*100))
new_page_prop = users[users.landing_page == 'new_page'].shape[0]\
/ users.shape[0]
print('Probability that an individual received the new page: {:.2f}%'\
.format(new_page_prop*100))
```

At this point of the analysis, there does not appear to be sufficient evidence to conclude that the new treatment page produces more conversions than the current control page. The probability that a user who receives the treatment page will convert is actually slightly less than the probability that a user who receives the control page will convert.

Because there is a time stamp associated with each user event, it is theoretically possible that a hypothesis test, similar to this one, could be run continuously in real time.

In this hypothetical situation, the hard questions become when to stop the test and how to reach a decision: As soon as one page is considered significantly better than the other? Once a certain page is better than another for a certain period of time? How long must the experiment be run before considering the test inconclusive? How much evidence is sufficient?

These are the difficult questions associated with A/B tests in general.

For the hypothetical A/B test considered in this section, the null and alternative hypotheses are as follows:

$$H_0: p_{new} - p_{old} \leq 0$$ $$H_1: p_{new} - p_{old} > 0$$

Or, verbally:

$H_{0}$: The likelihood of conversion for a user receiving the new page is less than or equal to the likelihood of conversion for a user receiving the old page.

$H_{1}$: The likelihood of conversion for a user receiving the new page is greater than the likelihood of conversion for a user receiving the old page.

For the next section, assume the null hypothesis. That is, assume $p_{new}$ and $p_{old}$ both have "true" success rates equal to the **converted** success rate regardless of page.

**Convert rate** for $p_{new}$ and $p_{old}$ under the null?

In [16]:

```
p_new = users.converted.mean()
p_old = users.converted.mean()
print('p_new = {:.2f}%'.format(p_new*100))
print('p_old = {:.2f}%'.format(p_old*100))
```

Also, Use a sample size for each page equal to the ones in **ab_data.csv**.

What are $n_{new}$ and $n_{old}$?

In [17]:

```
n_new = users[users.landing_page == 'new_page'].user_id.count()
n_old = users[users.landing_page == 'old_page'].user_id.count()
print('n_new = {}'.format(n_new))
print('n_old = {}'.format(n_old))
```

Next, perform the sampling distribution for the difference in **converted** between the two pages over 10,000 iterations of calculating an estimate from the null.

Simulate $n_{new}$ transactions with a convert rate of $p_{new}$ under the null. Store these $n_{new}$ 1's and 0's in **new_page_converted**.

Simulate $n_{old}$ transactions with a convert rate of $p_{old}$ under the null. Store these $n_{old}$ 1's and 0's in **old_page_converted**.

In [18]:

```
new_page_converted = \
np.random.choice([0, 1], size=n_new, p=[(1-p_new), p_new])
```

In [19]:

```
old_page_converted = \
np.random.choice([0, 1], size=n_old, p=[(1-p_old), p_old])
```

Next, find $p_{new}$ - $p_{old}$ for those simulated values.

In [20]:

```
new_page_converted.mean() - old_page_converted.mean()
```

Out[20]:

Next, simulate 10,000 $p_{new}$ - $p_{old}$ values using this same process. Store all 10,000 values in a numpy array called **p_diffs**.

In [21]:

```
new_converted_simulation = \
np.random.binomial(n_new, p_new, 10000)/n_new
old_converted_simulation = \
np.random.binomial(n_old, p_old, 10000)/n_old
p_diffs = new_converted_simulation - old_converted_simulation
```

Plot a histogram of the **p_diffs**.

This boils down to a computation of the "spread" of the data, assuming that the probability of converting a given user is the same whether they see the treatment page or the control page.

In [22]:

```
obs_diff = treatment_conv - control_conv
plt.hist(p_diffs);
plt.axvline(x=obs_diff, color='red');
```

What proportion of the **p_diffs** are greater than the actual difference observed in **ab_data.csv**?

In [23]:

```
p_value = (np.array(p_diffs) > obs_diff).mean()
print('p_value = {:.3f}'.format(p_value))
```

The histogram plotted in part i contains the sampling distribution under the null hypothesis, namely, that the conversion rate of the control group is equal to the conversion rate of the treatment group. The foregoing involves calculating what proportion of the conversion rate differences were greater than the actual observed difference, which was calculated from the conversion rate data. The special name given to the proportion of values in the null distribution that were greater than our observed difference is the "p-value."

A low p-value (specifically, less than our alpha of 0.05) indicates that the null hypothesis is not likely to be true. Since the p-value is very large at 90%, it is likely that our statistic is from the null, and therefore we fail to reject the null hypothesis. Ultimately, this indicates that it would be best for the hypothetical e-commerce company to keep the current page.

I could also use a built-in to achieve similar results. Though using the built-in might be easier to code, the above portions are a walkthrough of the ideas that are critical to correctly thinking about statistical significance. Here is a helpful link on using the built in.

`convert_old`

and `convert_new`

refer to the number of conversions for each page.

In [24]:

```
convert_old = users[users['group'] == 'control'].converted.sum()
convert_new = users[users['group'] == 'treatment'].converted.sum()
```

In [25]:

```
z_score, p_value = \
sm.stats.proportions_ztest([convert_new, convert_old],\
[n_new, n_old],\
alternative='larger')
print('Z-score critical value (95% confidence) to \n'\
+ ' reject the null: '\
+ str(norm.ppf(1-(0.05/2))))
print('z_score = ' + str(z_score))
print('p_value = ' + str(p_value))
```

Since the magnitude of the z-score of 1.31 falls within the range implied by the critical value of 1.96, I fail to reject the null hypothesis. The null hypothesis is that there is no statistical difference between the conversion rates for the control and treatment groups.

Additionally, since the p_value of 0.90 (note, approximately the same value as was calculated manually) is larger than the alpha value of 0.05, we fail to reject the null hypothesis.

Thus, for both the foregoing reasons, the built-in method leads to the same conclusion as the manual method.

This final part demonstrates that the result acheived in the previous A/B test can also be acheived by performing regression.

Logistic regression is the type of regression that should be performed in this situation, since each row is either a conversion or no conversion.

In this section, I utilize **statsmodels** to fit a logistic regression model to see if there is a significant difference in conversion depending on which page a customer receives.

First, I add a column for the intercept, called **intercept**. Then I create a dummy variable column for which page each user received. That column is called **ab_page**. It is 1 when a user receives **treatment** and 0 if the user receives **control**.

In [26]:

```
users['intercept'] = 1
users[['drop', 'ab_page']] = pd.get_dummies(users['group'])
users = users.drop(['drop'], axis=1)
users.head()
```

Out[26]:

The following line instantiates the model.

In [27]:

```
logit_mod = sm.Logit(users['converted'], users[['intercept', 'ab_page']])
```

The following line fits the model using the **intercept** and **ab_page** columns. The model predicts whether or not a user converts.

In [28]:

```
results = logit_mod.fit()
```

The following line prints a summary of the model.

In [29]:

```
results.summary()
```

Out[29]:

The p-value associated with ab_page in this regression model is 0.19. The p-value that was returned from the built-in ztest method was ~0.90. The p-value that I calculated manually was also ~0.90.

The null hypothesis associated with a logistic regression is that there is no relationship between the dependent and independent variables. In this case, this means there is no relationship between which page a user is shown and the conversion rate. The alternative hypothesis would therefore be that there is a relationship of some sort.

The null hypothesis from part 2 is that the likelihood of conversion for a user receiving the new page is less than or equal to the likelihood of conversion for a user receiving the old page. The alternative hypothesis from part 2 is that the likelihood of conversion for a user receiving the new page is greater than the likelihood of conversion for a user receiving the old page.

The factor that accounts for the large difference in the p-values may be that part 2 hypothesized one of the pages (specifically, the new_page the treatment group received) would lead to more conversions than the other. This is different from the hypotheses of part 3, which merely predicted a difference of some sort.

Including additional factors may make the model more predictive, yielding greater understanding. It may also result in business insights that would not have been evident in this simpler anlysis. For example, it would be possible to have different versions of the website for different locations. It is likely that people from different countries might have different tastes in website layout.

Possible disadvantages of additional factors include increased risk of human error, especially misinterpretation, as well as possibly obscuring the message the data is really trying to tell (decreasing the so-called signal-to-noise ratio).

Next, I add an additional factor for the country in which a user lives. I read in an additional file called **countries.csv** and merge it with the users dataframe.

For future reference, here are the docs for joining tables.

Finally, I generate dummy variables for these country columns.

In [30]:

```
countries_df = pd.read_csv('countries.csv')
users_country = countries_df.set_index('user_id')\
.join(users.set_index('user_id'), how='inner')
```

In [31]:

```
users_country.head()
```

Out[31]:

In [32]:

```
users_country[['CA', 'UK', 'US']] = \
pd.get_dummies(users_country['country'])
```

In [33]:

```
logit_mod_country = sm.Logit(users_country['converted'],\
users_country[['intercept', 'ab_page', 'US', 'UK']])
results_country = logit_mod_country.fit()
results_country.summary()
```

Out[33]:

It is necessary to exponentiate these coefficients since this is logistic regression.

In [34]:

```
US_coeff = 0.0408
UK_coeff = 0.0506
print('Exponentiating the US coefficient {} yields {:.3f}'.\
format(US_coeff, np.exp(US_coeff)))
print('Exponentiating the UK coefficient {} yields {:.3f}'.\
format(UK_coeff, np.exp(UK_coeff)))
```

The interpretation of the foregoing variables is counterintuitive. In this case, Canada is the baseline since it was the one out of three variables that wasn't included in the regression. We would say that US users are 1.04 times as likely (or 4% more likely) to convert as Canadians users. Similarly, we would say that UK users are 1.05 times as likely (or 5% more likely) to convert as Canadian users.

The effect is not statistically significant, given the fairly large P-values. Even if it were, it is not clear that such a small difference between the different countries would be practically significant.

Having examined the impact of the individual factors of country and page on conversion, I turn my attention to the interaction between page and country.

These new columns indicate that a given user received both the new page and lived in the indicated country.

Provide the summary results, and your conclusions based on the results.

In [35]:

```
users_country['CA_trt'] = users_country['ab_page'] * users_country['CA']
users_country['UK_trt'] = users_country['ab_page'] * users_country['UK']
users_country['US_trt'] = users_country['ab_page'] * users_country['US']
```

In [36]:

```
users_country.head()
```

Out[36]:

In [37]:

```
users_country[['country', 'group', 'CA', 'UK', 'US', \
'CA_trt', 'UK_trt', 'US_trt']].head()
```

Out[37]:

In [38]:

```
log_mod2 = sm.Logit(users_country['converted'], \
users_country[['intercept', 'ab_page', \
'US', 'US_trt', 'UK', 'UK_trt']])
results_log2 = log_mod2.fit()
results_log2.summary()
```

Out[38]:

The foregoing present a logistic regression for the case with the interaction terms. The interaction terms are not especially predictive of results, given the high P-values shown in the results.

Somewhat anticlimactic, but the conclusion of this study has to be that the the new page does not have a statistically significant impact on new user conversions.

Absent additional testing providing information to the contrary, the e-commerce company should keep the old web page.