Since we already have a clear understanding of what A/B tests are and what we can do through them, let’s see what the process is like to carry out an A/B test on our website:

  • Identify which web pages we want to analyze and the metrics we will use to evaluate the impact of the changes we will make on each page.

There are no restrictions on the number of A/B tests we want to run simultaneously, as long as they are on different web pages.

Likewise, each test can be evaluated with the metric we consider appropriate, independently of the rest.

  • Identify which element of each page we want to evaluate and what change we will make.

There can only be one change per page, affecting only that single element, to ensure the correct allocation of the impact on the variation of the metrics to be analyzed.

  • Create the corresponding variations of each page, identical to the original ones in all concepts except for the elements identified in the previous point.
  • Determine the duration of the experiment, either in number of visits or for a specific period of time.

In both cases, it must be large enough to collect the data needed for further analysis of the metrics.

  • Run the test, dividing the visits randomly by 50% between both versions of each page.

And recording separately the behavior of users according to the metrics we have associated for that page.

  • Stop the experiment and analyze the metrics of both versions of each page to determine the impact and extent of the change made on the original page.

This, once we have reached the number of visits or time period established as the duration of the experiment.

  • An experiment should not be carried out indefinitely.

Google supports A/B testing, but sets out some recommendations and conditions of use.

Once you have finished the A/B test, if you see better values for the chosen metric in the page variation, congratulations, you have managed to improve your website. But don’t rest on your laurels: the process doesn’t end here, it opens the door for more A/B tests.

This time, on the new version of the page and thus continue to improve other aspects of it, for the same or other metrics.

If, on the other hand, the metrics in variation have been worse, don’t be embarrassed, because you have learned an important lesson: what doesn’t attract or please your audience.

And that can possibly inspire you to make other changes that they do like or like.

Although the results of an A/B test do not reflect improvements in the target metric, they will help us to get to know our audience better.

In this process, note that all the points depend exclusively on our decisions, except the point dedicated to distributing the visits between the control page and its variation.

Performing this task must be implemented on the web server itself and therefore requires a technical solution, to which we will devote the last section of this article.

How long should an A/B test last?

If you look at the point in the process of running an A/B test on the duration of the experiment, it said that it should be large enough to collect the necessary data’.

But how much is “big enough”?

Depending on the context, when I hear or read the expression “big enough” I usually translate it to “nobody knows” or “it depends”.

To understand it better, let’s flip a coin several times and write down the times when they come up heads and tails, I’ll do it for you.

 

If we flip the coin ten times, we will most likely get a 6-sided, 4-cross deal, or 4-sided, 6-cross deal.

We would also get 7 faces and 3 crosses, or the other way around, quite a few times, but rarely will we get exactly 5 faces and 5 crosses.

Now suppose we flip the coin a hundred times.

Most of the time, we wouldn’t get a 50-sided, 50-cross split either, but almost always values in the range 45-55.

The same thing would happen if we threw it a thousand times.

It is very unlikely to get 500 faces and 500 crosses but, most of the time, the number of faces and crosses would be between 475 and 525.

That is to say, in a totally random process of 50%, such as the tossing of a coin, if we repeat it many times, we will not obtain an exact distribution of 50% of faces and crosses, but a small range around this percentage and not for that reason we will think that the coin is counterfeit.

Maybe you’re starting to see where I’m going: the distribution of visits between the control page and the variation page in an A/B test is also done randomly at 50%, so that each page receives approximately the same number of visits over time.

But, as we have seen in the example of coins, randomness can be capricious and introduce small fluctuations, completely at random, into this distribution.

An A/B test performed on a small number of visits will never give reliable results in the evaluation of the target metric.

Therefore, when it is said that the duration of the A/B test should be “long enough”, it means that we must ensure that there are enough visits to counteract those distributions that, by chance, assign some qualified visits to one version of the page more than to the other.

How to analyze the results in A/B testing?

However, it is not enough to have a high number of visits to eliminate the effect of randomness and to analyze the test results, but we also have to take into account how much the metric we have chosen for the test changes.

Let’s see it with an example….

In our initial example, we had considered 10,000 visits with a conversion rate of 1%.

This means 100 conversions per month.

But let’s take another case, one page with 1,000 hits and a conversion rate of 10%; it would also mean 100 conversions per month.

Suppose we run an A/B test on both pages for one month, getting the same result: 95 conversions on the control page and 105 conversions on the variation page.

What conclusions could we draw from these results, and in both cases is the variation page better?

For the time being, it should be noted that the conversion rate is in itself an average value, which means that it will also show small fluctuations over time.

With this consideration in mind, the conversions obtained in the first example translate into conversion rates of 0.95% and 1.05% respectively, while in the second example they translate into rates of 9.5% and 10.5%.

It doesn’t seem so clear anymore that the variation page is better in both cases, does it?

Forced to choose, I would keep the control page in the first case, since the variation does not introduce any significant change in the conversion rate.

In the second case, I would keep the variation page, although I would also analyze the evolution of the conversion rate during the previous months to check to what extent the observed improvement is not part of the fluctuation of this metric.

As you can see, there is no “magic number” for the duration of an A/B test or an “infallible recipe” to analyze the results, but we must choose them carefully according to the number of visits, the change in the metric chosen during the test and the fluctuation of this metric in previous periods.

We must be aware that the “spectacular improvements” of a metric-target will only be achieved step by step, test by test.

Of course, if the metric analyzed shows a huge margin for improvement (from 10% to 50%, or so on), there’s no doubt which page to choose but, as you’ll see when you do your own tests, these results are rare.

Fortunately, for practical purposes we will not have to worry too much about the length of the A/B tests and how to analyse their results. It is important that we understand why we should not shorten them.  

Does SEO get along with A/B Tests?

Perhaps you have already wondered: if we have two versions of the same page, will it not negatively affect its organic positioning, since the visits are divided between them?

What’s more, couldn’t Google even penalize us for having two different pages with the same address, mistaking it for a cloaking attempt?

First of all: no, Google will neither penalize you nor affect your organic positioning.

In fact, Google published on its official blog (needs update by google) that it fully accepts the A/B Tests, as well as other types of tests, since the aim of these tests is to improve the user experience and make the content more attractive and useful to the user.

A/B Tests do not penalize SEO of pages, to the extent that Google encourages them to be done to improve the user experience.

In the same article, Google recommends good practices to ensure that we do not violate any of the Webmaster Guidelines or that Google may penalize us and I summarize below:

  • Do not use the test as a pretext for cloaking. We are talking about very similar pages, with very small and localized differences, whose access does not depend on who (or what, if it is the Google crawler) navigates the page.
  • If the control page and its variant use different URLs, add the attribute’rel=’canonical” in the variant and point to the control page, to inform Google that the two pages are related.
  • Do not use the noindex meta tag on the variant page, as it may confuse Google and cause it to canonize the variant to the detriment of the original page, discarding the address of the original page in the search indexes.
  • If the control page and its variant have the same URL, with the web server doing the redirect to show the variant, use 302 redirects to tell Google that it is a temporary redirect and that the relevant URL is the original one, not the redirected one, and thus prevent Google from indexing it to the detriment of the original one.
  • Establish a time limit for the test run. By definition, these tests are performed to differentiate which version of a page provides the best results and to publish it as final. They cannot therefore be implemented indefinitely. If Google detects a test that has been running “too long”, it may interpret it as an attempt to manipulate search results and penalize it. Google doesn’t say how much “too long” it is, because it depends on many factors, but the number of visits can be a reference: the more web traffic, the less need there is for the test to last longer.

From a practical point of view, we only have to worry about the last point, to avoid that the A/B test is executed indefinitely, since the rest depends on its execution being correctly implemented in the web server or the tool we use.

Advantages and limitations of A/B Tests

Throughout the explanation of the A/B tests and their execution process we have seen some of their advantages and also their limitations.

Let us now review them all together to get a better perspective of their scope and possibilities:

Advantages

  • The two versions of the page receive a similar number of visits during the same period of time, so any outside influence will affect both equally and will not contaminate the interpretation of the results.
  • The cost of loss of profit is minimized, in the event that the change in the page would damage the value of the chosen metric, as it only affects 50% of the qualified visits (those that visited the variant page).
  • By analyzing only one very specific change in the original page, the evolution of the metric analyzed will be exclusively due to that change, facilitating the choice of one version or another.
  • The duration of the experiment can be variable, adjusting to the volume of visits needed to obtain relevant results in the selected metric.
  • Cumulative tests can and should be done, one after the other. That is to say, first on an element, selecting the winning page and then continuing successively with other specific changes on the page.
  • Experience gained with an A/B test, whether positive or negative, can be used to design future pages with similar objectives.

Limitations

  • You can only see the effect of a very localized change within the whole page, so it does not provide any reference for the first design of the page or new campaigns.
  • At first, it is not known which change may be more favourable (e.g. which colour, which text or which image) and requires other types of analysis to be used together to identify it (e.g. A/B tests usually complement heat maps very well).
  • It may require a high number of visits in order to accumulate sufficient data for the analysis of the target metric to be relevant and reliable.
  • It only allows you to compare two versions of a page. Although it can be extended to incorporate more versions, the test duration should also be longer to collect enough data to analyze the effects on the metric.
  • The distribution of visits between the versions of the page remains constant and at 50% throughout the duration of the test, even when the metric starts to show a clearly favourable trend towards some of the versions.