A recent discussion on a user experience forum I participate in turned to the topic of A/B testing. I really enjoyed the conversation so I wanted to reiterate some of the points I made, and expand on it a little bit as well. It’s not my goal to define A/B testing here but to share my opinion on its use. I believe that even though A/B testing can be extremely valuable to help identify the best iteration of a site or a particular page, it should never be used in isolation.
Since A/B testing is relatively cheap to do and the results are so compelling, companies are in danger of adopting a “test and learn” culture where pages are just A/B tested with no additional user input. That would be the wrong way to go. A/B testing shouldn’t be used on its own to make decisions, it should always be used in conjunction with other research methods — both qualitative (such as usability testing, ethnography) and quantitative (such as desirability studies).
A/B testing is an important method in the research toolkit because it can give you information that usability testing on its own cannot. The main goal of A/B testing is to see how business metrics move up and down depending on the version of the page — click through rates, checkout rates, purchasing rates, etc. You can’t see that with usability testing alone. But as Kohavi et al. point out in their paper Practical Guide to Controlled Experiments on the Web, A/B testing has some major limitations:
- Quantitative Metrics, but No Explanations. It is possible to know which variant is better, and by how much, but not why. In user studies, for example, behavior is often augmented with users’ comments, and hence usability labs can be used to augment and complement controlled experiments.
- Short term vs. Long Term Effects. Controlled experiments measure effects during the experimentation period, typically a few weeks. It is wise to look at delayed conversion metrics, where there is a lag from the time a user is exposed to something and take action. These are sometimes called latent conversions.
- Primacy and Newness Effects. These are opposite effects that need to be recognized. If you change the navigation on a web site, experienced users may be less efficient until they get used to the new navigation, thus giving an inherent advantage to the Control. Conversely, when a new design or feature is introduced, some users will investigate it, click everywhere, and thus introduce a “newness” bias.
- Features Must be Implemented. A live controlled experiment needs to expose some users to a Treatment different than the current site (Control). The feature may be a prototype that is being tested against a small portion, or may not cover all edge cases. Nonetheless, the feature must be implemented and be of sufficient quality to expose users to it.
- Consistency. Users may notice they are getting a different variant than their friends and family. It is also possible that the same user will see multiple variants when using different computers (with different cookies).
As with most things, it is important to use A/B testing responsibly. Since every research/testing method comes with its own limitations, a combination of methods is the only way to get the full picture and make the right decisions.