In this sectionwe discuss a couple more issues we analyzed while designing Etsy’s solutions to peeking.
We discovered that when using a pvalue curve tuned for a 5 percent false positive rate, our ancient stopping threshold doesn’t significantly increase the false positive rate and we could be confident of a directional shift.
Previously in our A/B testing tool UI, we exhibited statistical data as shown in the table below on the leftside. The”observed” column indicates results for the control and there is a”% Change” column for every treatment variant. When hovering over a number at the”% Change” column, a popover table appears, showing the actual and observed effect size, confidence level, pvalue, and amount of days we could expect to have enough information to power the experiment based on our anticipated effect size.
Now, you might already have guessed that the simplest way to solve the problem is to correct a sample size beforehand and conduct an experiment before the end before checking the significance level. This requires strictly enforced separation between the design and evaluation of experiments which can have large repercussions throughout the experimental Procedure. In early stages of an experiment, we might miss a bug using the characteristic being tested that will invalidate our results or in the set up. Our experimental procedure slows down unnecessarily, leaving time for site changes and iterations, if we don’t grab these ancient. Another problem up is that it can Be Hard to predict the effect size merchandise groups would like to get prior to the experiment, which can make it Difficult to optimize the sample size Ahead of Time. Supposing we put up our experiment there are down the line implications. When a metric is being impacted by an experiment in a negative manner, we would like to be aware as soon as possible so that we do not negatively influence our users’ experience. These considerations become even more pronounced when we are conducting an experiment or in a part of the site and it can take months to get to the target sample size. Round teams, we want to be able to iterate without sacrificing the integrity of the outcomes.
This job is a collaboration between Callie McRee and Kelly Shen from the Analytics and Analytics Engineering teams. We’d like to thank Gerald van den Berg, Emily Robinson, Evan D’Agostini, Anastasia Erbe, Mossab Alsadig, Lushi Li, Allison McKnight, Alexandra Pappas, David Schott and Robert Xu .
 Display a visualization of the C.I. and colour the bar red when the C.I. is completely negative to indicate that a substantial decrease, green if the C.I. is completely positive to indicate a significant growth, and grey when the C.I. crosses 0.
 Display different messages in the”% Change” column and table to signify unique phases the experimentation metric is currently in, based on its power, pvalue and calculated flexible pvalue threshold. At the”% Change” column, potential messages include”Waiting on info”,”Not enough data”,”No change” and”+/ X%” (to show substantial increase/ reduction ). In the hover table, possible headers comprise”metric isn’t powered”,”there Is Not Any detectable change”,”we’re confident we detected a shift”, and”directional change is right but magnitude might be inflated” when early stopping is attained but the metric isn’t powered yet.
One of the downfalls with quitting experiments early, however, is that with an effect size under ~5 percent, we tend to overestimate the impact and expand the confidence interval. We developed a formula to use which we Opt to end early, to feature increases in metrics to civic wins. By placing a standard of conducting experiments for at least 7 days to account for weekend and weekday tendencies, furthermore, we offset some of them.
We researched a few methods including O’BrienFleming, Pocock and sequential testing utilizing difference in powerful observations. We settled on the strategy. Utilizing the difference in successful observations, we look at the gap and prevent an experiment when this gap becomes large. Till we reach a total number of visits that are converted the gap threshold is valid. This way is good for detecting changes and does so quickly, which makes it most suitable for our requirements. We did consider a few cons this method presented. Whereas looking at difference in converted visits doesn’t take into account overall population size traditional power and significance calculations use proportion of victories. Because of this, we are more likely to make it to the amount of converted visits until we see a large enough gap in visits with high goal metrics that are baselines. This means we are likely to overlook a change that is true in these scenarios. Furthermore, it requires when an experiment is not evenly split across variants extra set up. So we can improve our rate of discovering real changes between experimental classes we opted to use this technique.
Accurate interpretation of statistical information is vital in making informed decisions regarding product development. When experiments that are online have to be run efficiently to save cost and time, we encounter dilemmas unique to our circumstance, and peeking is one of them. In exploring and designing solutions some rigorous job was analyzed by us. However, the features and priorities in experimentation that is online makes its application difficult. Our strategy outlined in this post, even though simple, addresses the origin of the difficulty . Looking ahead, we believe the balance between rigorousness and constraints is what makes experimentation intriguing and enjoyable to work on, and we all at Etsy are excited about tackling issues that are interesting awaiting us.
Figure 6: User friendly interface following changes.
Etsy relies heavily on experimentation to improve our decisionmaking process. We leverage our testing tool when we launch new features, polish the look and feel of our website, or even make modifications to our search and recommendation algorithms. Because our experimentation platform increases and experimentation’s velocity raises quickly across the company, we also face a number of new challenges. In this post, we explore one of these challenges: the best way to glance at experimental consequences without sacrificing the integrity of our outcomes, in order to boost the velocity of our decision.
Even after producing these UI modifications, creating a choice on when to discontinue an experiment and whether or not to establish it is not always simple. Generally some things we advise our stakeholders to consider are:
At Etsy, the quantity of traffic and intent of traffic varies from weekdays to weekends. This is not a concern for the sequential testing approach we finally chose. However, it would be a problem for various methods that require information sample size that is equal. We looked into ways to take care of the inconsistency in our data sample dimensions. We found that the GroupSeq package in R, that enables the construction of group sequential designs and has various alpha spending functions available to choose one of, is a fantastic way to account for it.
Our Approach
 Can we have statistically significant results which support our theory?
 Do we have statistically significant results that are positive but are not what we anticipated?
 If we do not have enough data yet, can we simply keep it running or is it obstructing other experiments?
 Can there be anything broken into the item experience that we want to correct, even if the metrics don’t show anything negative?
 If we have enough details on the main metrics overall, do we have sufficient info to iterate? By way of example, if we want to look at impact on a section, which could be 50 percent of the visitors, then we will need to run the experiment.
During A/B testing, we are searching to ascertain whether a metric we care about (i.e. percent of people who make a purchase) differs between the treatment and control groups. But when we detect a change in the metric, how can we know if it’s true or due to random chance? We can have a look at the pvalue of our statistical evaluation, which indicates the probability we would see the detected difference between groups assuming there is not any true difference. We state that the outcome is significant when the pvalue falls below the significance level threshold and we reject the hypothesis that treatment and the control are the same.
Sequential Testing with Difference in Converting Visits
Tradeoff Between Power and Significance
To confirm this approach, we analyzed it on results from experimental simulations with various baselines and effect sizes using mock experimental problems. Before implementing, we wanted to know:
We hope that these UI changes will help our stakeholders make better educated decisions while allowing them discover cases where they have changed something more radically than anticipated and consequently can stop the experiment earlier.
Secondly, Confidence interval (CI), is the array of values which are a great estimate of the true value where we’re confident a particular metric falls. From the circumstance of A/B testing for example, if the experiment was run countless times by us, 90 percent of the time the worth of a effect size would fall within the 90% CI. There are three items that we care about in relation to the confidence interval of a result in an experiment:
In this section, we dive to the approach that we’ve designed and adapted to address the peeking difficulty: transitioning from conventional, fixedhorizon analyzing to successive testing, and preventing peeking behaviors through user interface changes.
Within our experimental testing tool, we desired stakeholders to have access to calculations and metrics we measure during the duration of the experimentation. Teams at Etsy have to frequently coordinate experiments on the page so it is important for teams to have some idea of an experiment will have to run assuming no early stopping. We do it until we reach a set electricity, by running an experiment.
However, always displaying numerical effects at the”% Change” column could lead to stakeholders peeking at data and creating an incorrect inference about the success of the experiment. Therefore, we added a row at the hover table to show the energy of the evaluation (assuming some fixed effect size), and made the following changes to our consumer interface:
Figure 1: Odds for accepting a and B are different, using A and B both converting at 50%.
It might be worth noting that the peeking problem has been analyzed by many, such as industry specialists ^{1, }^{two }, developers of largescale business A/B testing platforms^{3, }^{4} and academic investigators ^{5}. What’s more, it is a challenge exclusive to. The peeking problem has troubled the medical field for a very long time; for example, medical scientists could glance in the outcome and stop a clinical trial early due to first positive results, leading to faulty interpretations of their information ^{6, }^{7}.
Let’s look at an illustration to acquire a more concrete view of the problem. Suppose we conduct an experiment where there is no true change between the control and experimental version and both have a goal metric of 50%. If we’re using a significance level of 0.1 and there is no peeking, in other words, the sample size needed before a choice is made is determined beforehand, then the rate of false positives is 10%. But if we do peek and we assess that the significance level at each observation, then after 500 observations, there is over a 50% likelihood of incorrectly stating that treatment is different than the controller (Figure 1).
Further Conversation
Closing Thoughts
So we can just stop the experiment when the hypothesis test for the metric we care about has a pvalue of less than 0.05, right? Wrong. In order to draw on the most powerful conclusions from the pvalue from the circumstance of an A/B evaluation, we have to make a determination on the, and to have mended the sample dimensions of an experiment beforehand. Peeking at data regularly and quitting an experiment as soon as the pvalue drops below 0.05 raises the rate of Type I errors, or false positives, because the false positive of each test compounds increasing the overall probability that you will observe a false result.
 Whether the CI includes zero, because this maps precisely to the decision we’d make with the pvalue; if the 90% CI includes zero, then the pvalue is greater than 0.1. Conversely, if it doesn’t include zero, then the pvalue is significantly less than 0.1;
 The bigger the CI, the better estimate of the parameter we now have;
 The further away from the CI isthe more confident we can be that there is a true difference.
There is a tradeoff between Type I (false positive) and Type II (false negative) errors – when we decrease the likelihood of one of those errors, the likelihood of another will increase – to get a more thorough explanation, please see this short article . This translates into a tradeoff involving pvalue and power because if we require stronger evidence to reject the null hypothesis (i.e. a more compact pvalue threshold), then there’s a smaller chance that we’ll be able to correctly reject a false null hypothesis a.k.a decreased power. Equilibrium this issue to some degree. At the conclusion, it is a choice that concentrate on experimentation and we have to make based on our own priorities.
The sequential sampling method that we have designed is a straightforward form of a stopping rule altered to best suit our requirements and circumstances. It is indeed a problem that is rather interesting!

Keeping this in mind, we will need to come up with statistical methodology which will give reliable inference while still supplying product groups the capability to continuously track experiments, particularly for our longrunning experiments. In Etsy, we handle this challenge from two sides, user interface and procedures. We made a few user interface changes to our own A/B testing instrument to stop our stakeholders out of drawing false conclusions, and we executed a flexible within our system, which takes inspiration in the testing concept in statistics.
 Sequential Generalized Likelihood Ratio Tests for Vaccine Safety Assessment by Shih, M.C., Lai, T. L., Heyse, J. F. and Chen, J. (2010), Statistics in Medicine, 29: 26982708
Our implementation of this procedure is affected by the approach Evan Miller clarified here. This method sets a threshold for difference between the treatment and control converted visits based on effect that is minimal and target false positive and negative rates. If the threshold is reached or passes by the experiment, we allow the experiment to end early. We evaluate our results using the standard approach of a power analysis if this difference Isn’t reached. The mixture of these methods makes a constant pvalue threshold for once the pvalue is under the curve, that an experiment can be safely stopped by us. This threshold is reduced near the beginning of an experiment and converges as the experiment reaches our power. This permits us to detect changes quicker with baselines while not missing changes that are smaller for experiments with higher baseline target metrics.
Other Types of Designs
It does so by calculating the probabilities of all falsepositives at every stopping point that is potential using dynamic programming, assuming that our test data is distributed. Since we can compute these probabilities, we could then correct the test’s pvalue threshold, which consequently changes the falsepositive chance, at every step so the complete false positive rate is below the threshold that we desire. Thus, sequential testing enables concluding experiments as soon as the information justifies it, while keeping our false positive rate in check.
We tested this method with a series of simulations and saw that for experiments that would require 3 weeks to run assuming a standard power analysis, we can save at least a week in most cases where there was an actual change between variants. This helped us feel confident that even it was worth the time savings for teams with baselines target metrics that struggle with long experimental run times.