top of page
  • Writer's pictureSimon Jackson

What are AA tests and why you should run heaps of them!

Banner image with article title

✋ “Umm, what’s an AA test?”


This was THE FIRST question I ever asked as a data scientist.


Ever since, I’ve been constantly explaining what AAs are and recommending that product and growth teams to run more of them!


Classic example from earlier this week:




So today I’m going to get it all out and, hopefully, convince you too.


Because, as you’ll learn here, when you start to run AA tests, the foundations of your experiment programs go from quick sand to concrete.


So, let’s dive into this often under appreciated topic.


What is an AA test?

The "AA" is a play on “AB tests.”


Most readers will know this but just in case: The “A” and “B” in AB testing refers to two different versions of a product being shown to different groups of customers. Here’s what an example might look like from me doing a search on Google using (A) my logged-in account and (B) an incognito window:


Side-by-side comparison of two google search results pages. The one on the right is slightly different.
Example of an AB test (not AA)

Side by side, you can see that the company icons are being rendered larger and in a higher-contrast way in the version on the right.


So if “A” and “B” are different, then “A” and “A” are… the same!


An AA test is like an AB test in every way except customers split in each group see the exact same thing. No change at all!


Seems boring, right?


Time to dive into why I care so much about them!


Reason 1 to run more: Making sure things work


AB testing is pretty complicated. Seriously.


It seems easy at a high level to “just give different customers different versions.” But the technology that powers online AB testing is complicated. It’s why I’ve worked with teams ranging from 20 to 200 highly-skilled engineers (depending on the company) to build, maintain, and improve AB testing platforms.


So how do we know that something so complicated is working? We run the damn thing!


But there’s a problem. When you run an AB test, you have no idea what the impact of going from A to B on customers is 😬


This brings us to the first major reason to use AA tests: Because we know exactly what’s going on (nothing), we know what results to expect - nothing 😜


Running AA tests is the safest and fastest way to check all sorts of things are working with your experiment software such as:

  • Am I starting to enrol participants as expected?

  • Are there any signs of sample ratio mismatch?

  • Are the statistical results being computed correctly?

  • Over time, do I see a false positive rate (significant results) as often as I’d expect (e.g., 5% or 10%)?

  • etc!


Perfect example from my time at Meta: we had constant AA tests running to monitor the false positive rate of or testing. One day we noticed a massive spike, meaning all running experiment results were rendered invalid - oh crap! Turns out it was caused by a different platform team introducing a breaking change to a service we depended on. Anyway, monitoring this enabled us to notify all experimenting teams about the problem before they all got too excited about their positive results (like baby Yoda down here). We then go to work identifying and resolving the issue.


MEME: Top panel happy baby Yoda with text, "Third significant AB test in a row!" Bottom panel is sad baby yoda with text, "Finding your AA test is always significant too."


The point is, running AA test is often your best way to make sure everything is working before you start running AB tests! And because customers aren’t getting anything different, if something goes wrong, you know there’s a problem with your experiment tech and nothing else.


Reason 2 to run more: Baselines for Planning

OK, you’ve now run a couple AA tests and know everything is working. Happy days!

Time to forget about them?


NOOOoooooo!


Knowing things are working is just table stakes. Now you get to focus on the juicy stuff – running great experiments!


But what’s one of the hardest things about running good experiments?


Experiment Design.


The fastest way to throw AB testing money down the drain is to just run experiments without any thought or planning to ensure you can detect the signals you care about.


To design a great experiment you need to plan for things like:

  • How many participants will I have available?

  • How do participants enrol over time?

  • What are the baseline metric scores I should expect?

  • etc


Now, you could go and get a data person to go scraping your web data, spending hours mindlessly bringing it together, and trying to figure out the answers to these questions.


OR – better idea – if you’d run an AA test in the past, you’ll have this all just waiting for you!


But don’t stop there...


Answers to these questions are usually highly dependent on WHERE you’re running experiments, WHO you’re running experiments on, even WHEN you run them.


So, one of the best things you can do…


Run AA tests everywhere, all the time!


OK, maybe I’m going a bit overboard BUT the point is this: regularly running AA tests in many areas of your product will mean you constantly have access to the data you need to plan great experiments.


No more guessing. No more data people tantrums. No more uncertainty. Just a nice, hot, bowl of fresh data and experiment planning confidence.


Reason 3 to run more: New-Feature Releases


My third and final reason came from a reminder by Bertil's comment on my recent post:

Screenshot of a reply to my LinkedIn post from earlier. THe reply mentions large companies using AA tests to check metrics before release.

It's common when you start to run larger experiment programs that you want to release new features... and I don't mean new customer-facing features.


I mean new features for your experiment platform to help teams experiment better and faster. For example, new statistical techniques, results UIs, new metrics, and so on.


Similar to there being various gates to safely release code to end users of a product (with AB testing typically being the final one), you can use AA tests to validate the behaviour and quality of features such as new experiment metrics.


For example, six years ago I introduced CUPED at Booking.com as a statistical technique to boost power and help run smaller, faster experiments. I did so by creating a new CUPED-adjusted metric that could be added to experiments. Now, everything panned out in my analyses but you never really know until something hits production. So I added the new metrics to a bunch of AA tests and was able to monitor it on live data. Similar to above, I could now answer questions like:


  • Did the metric display in an understandable way on the platform?

  • Was it producing data in a way I expected?

  • Did it maintain an expected false positive rate?

  • Did (as we hoped in that context), it yield 30-40% lower variance than it's unadjusted version?

  • etc


These are critical tests that can be done before releasing new experiment features like metrics that can make their way into everyone's hands and have an amplified impact (for positive or negative).


Run more AA tests. You won’t regret it.

Well, this has been a long one today but I hope you now know exactly what AA tests are and why they're so useful.


So get out there and start running heaps of AA tests... You won’t regret it!


Until next time, thanks for reading! 👋

If you found this useful, subscribe below to be notified of our next one 👇

Subscribe

Enjoyed this post? Have the next one delivered straight to you!

Thanks for submitting!

bottom of page