Briefer
StatisticsDeep dives

Bayesian A/B Testing

Why Bayesian methods can be more interesting and intuitive than frequentist ones when it comes to A/B testing.

Thaís Steinmuller

Thaís Steinmuller

May 13, 2025
14 min read

When people decide to do an A/B test, they usually want to know one thing: "Is A better than B?".

So they go ahead and paint a sign-up button green, and another one blue. Then, they display them to 500 users each, and look at the results. Which one gets more clicks?

After 500 visitors see each button, the numbers are in:

  • Green: 27 sign-ups, that's 5.4%.
  • Blue: 23 sign-ups, 4.6%

This is where many people stop. They look at the numbers, and declare a winner. Green is better than blue. Case closed.

The problem with this approach is that it only answers one question: is A better than B? But it skips over a really important part of the problem: how confident are we that this is true?

Imagine this: if you flipped a coin 10 times and it landed on heads 7 times, would you be ready to bet your savings that the coin is biased? Probably not, because with such a small sample, randomness can easily skew the result.

The same logic applies here. Just because the green button got 5.4% and the blue button got 4.6% doesn't mean green is definitely better. It just means it looks better based on this tiny snapshot of users. If you ran the experiment again, there's a real chance those numbers could flip.

And here's the kicker: what if those sign-ups directly impact revenue? You switch to green, but in reality, blue might have been better. That 0.8% difference could vanish (or even reverse) when more people interact with it. Consequently your revenue drops, you miss Q4 targets, and there goes your bonus. I guess Maldives are out of the question this year.

In a high-stakes environment, shipping a change based on a misleading snapshot isn't just risky; it can lead to costly, irreversible mistakes.

That's what makes A/B testing hard. It's not just about the numbers you see; it's about the uncertainty behind them.

So how do we start measuring that uncertainty? How do we quantify the risk of making a decision based on a small sample size?


Measuring uncertainty the usual way

Using the textbook method to measure uncertainty, we need to know how likely it is that the difference we saw (5.4% for Green and 4.6% for Blue) is actually real and not just a fluke.

Think of it this way: if we ran this exact experiment 1,000 times, how often would Green actually outperform Blue? Would it win 900 times? 700? Or just 400? If Green wins less than half the time, then it's not really a winner, it just got lucky the first time we ran the test.

In frequentist statistics, we handle this with something called a p-value.

A p-value doesn't tell you the probability that Green is better than Blue. Instead, it measures this:

If Green and Blue were actually equally as good, how likely is it that we'd see a difference as big as 0.8% (or more) just by random chance?

In other words the p-value is the probability of observing a difference as extreme as the one we saw (0.8%) if both buttons were actually equally good.

A low p-value, like 0.01, means that seeing a result like 5.4% for Green and 4.6% for Blue would be rare if the buttons were actually equally good. It might happen just 1 in 100 times, purely due to chance. It's like saying: "If there's no real difference, this result shouldn't be showing up... and yet it did."

That's why a low p-value is treated as a signal that something might be going on. It doesn't prove that Green is better, but it tells you: "This outcome looks too unusual to chalk up to randomness." And that's often enough for people to say, "Let's go with Green."

But if the p-value is high (say, p = 0.5) that means a result like 5.4% vs. 4.6% would be totally normal even if the buttons were identical. In that case, there's nothing unusual to explain so there is a strong reason to believe that Green isn't actually better than Blue.

In a way, looking for a p-value is like asking: "how often could random chance alone make Green look better than Blue?" Still, a p-value doesn't tell you how likely it is that Green is actually better. It only tells you how much your result challenges the assumption that there's no real difference.

This approach is the frequentist approach. It treats the hypothesis as fixed and the data as random. In other words, the hypothesis is a fixed statement about the world, and the data is just a sample from that world. So you can't assign probabilities to the hypothesis itself.

The idea is: "The truth is out there. We just saw one noisy sample from it." The method doesn't allow you to say "there's an X% chance that Green is better than Blue". That's because in this view, it either is or isn't.


Measuring uncertainty the Bayesian way

Bayesian thinking flips that around.

Bayesians treat the data as fixed. It says: "We saw this data, and now we want to know how likely different hypotheses are given that data."

We've seen what we've seen: 27 sign-ups vs. 23. What's uncertain is the hypothesis: maybe Green is better, maybe not. The question becomes:

Given this data, what's the probability that Green is truly better than Blue?

That's what Bayesian A/B testing is designed to answer.

The Bayesian approach focuses directly on the data you observed. Instead of a p-value, you get a probability that tells you "there's a 71% chance that Green is actually better than Blue, based on the data we have."

It says:

"Given the data I have, here's the full range of conversion rates that are possible, and here's how likely each one is."

We're not just building a fence around 5.4%. Instead, we're drawing a full probability curve that shows where the real rate is most likely to be.

This is a much more useful way to think about the problem. It gives you a complete picture of the uncertainty based on what we already know about the it, rather than just a single number and a range.

To model uncertainty in a Bayesian way, we need a distribution that captures this range of plausible values for the conversion rate we're estimating.

Bayes' Rule, the foundation of Bayesian inference, lets us do that. It looks like this:

Where:

  • Prior is what we believed about the conversion rate before seeing the data. Maybe we had a general sense of how many people usually sign up based on past experience or similar tests.

  • Likelihood is how well different possible conversion rates explain what we saw in the test. For example, if Green got 27 sign-ups out of 500, then rates like 5% or 6% make sense. Rates like 1% or 15% don't.

  • Posterior is our updated belief after combining what we thought before (the prior) with what the data shows (the likelihood). This is what we use to estimate how likely it is that Green is actually better than Blue.

  • Evidence (also called the marginal likelihood) is a scaling factor that makes sure all the probabilities add up to 100%.

In practice, we rarely compute the denominator directly. The evidence is constant with respect to the parameter we're estimating (in our case, the conversion rate), so we can simplify the formula:

This tells us: What we now believe = What we believed × What the data tells us

To figure out what we now believe about a button's true conversion rate, we first need to model what we believed before the test. This means we need a mathematical tool that can express that belief.

In the Bayesian approach, we begin with a belief. A belief is a sense of which outcomes are more or less plausible before we see any data. What do we already know about our site's conversion rates? Are they usually around 5%? Closer to 10%? How likely is it that a new feature converts higher than 6%?

We capture that belief with a prior distribution. That's a curve that assigns different degrees of plausibility to different conversion rates. It doesn't predict what will happen. It simply quantifies what we consider likely before we collect new evidence.

We know each visitor either converts or doesn't—success or failure, yes or no. When we run an experiment with a fixed number of visitors and record how many convert, we're counting the number of successes across a series of independent trials. That's precisely the kind of process the binomial distribution is meant to describe: it tells us the probability of observing a certain number of conversions, given a fixed number of trials and a known conversion rate.

But, in practice, we don't know the conversion rate. Instead, we're trying to estimate it.

To estimate that, we need a distribution over the space of possible conversion rates: a curve that captures how likely each value seems in light of the evidence.

That's what the Beta distribution gives us. It's defined over the interval from 0 to 1, and it's flexible enough to model everything from total uncertainty to strong conviction. Here's what it can look like in different scenarios:

Even better, it works naturally with the binomial distribution. When we update a Beta prior with binomial data, we get a Beta posterior. This property, called conjugacy, makes the updating process mathematically seamless. Our beliefs can change but the form of our model stays the same.

Say your experience as a product manager tells you that most new features tend to "convert" at around 10% of users. Rather than discarding that knowledge, Bayesian inference lets you incorporate it directly into your model by shaping your prior distribution to reflect it.

To do incorporate that knowledge into our model, we use the Beta distribution, which is controlled by two parameters: α and β. These parameters shape the curve and determine how confident we are in different conversion rates. The mean of a Beta distribution is:

So if we believe the true rate is likely around 10%, we can choose values of α and β that center the distribution at 0.10. For example, Beta(2,18) has a mean of 2/(2+18) = 0.10.

This is a way of saying: "Before seeing any data, I believe a 10% conversion rate is most likely, and I'm about as confident in that belief as if I had already seen 20 users, 2 of whom converted."

The notation Beta(α,β) simply tells us how that belief is distributed across the 0-1 range. Higher values of α and β mean stronger prior confidence; smaller values mean more uncertainty.

What the data tells us is that 27 out of 500 visitors convert for the Green button. To update our prior belief, represented by Beta(2,18), we simply incorporate the new evidence.

We do that adding 27 to α (for the conversions) and 473 to β (for the non-conversions). This gives us an updated posterior distribution:

Posterior = Beta(2+27, 18+473) = Beta(29,491)

Here's what that the graph looks like :

This posterior distribution combines what we believed before the test with the new evidence from the data. It now represents our current belief about the true conversion rate of the Green button.

We model the Blue button the same way.

Starting with the same prior belief, that Beta(2, 18), centered at a 10% conversion rate, we now update it based on the observed data: 23 conversions out of 500 visitors.

Adding 23 to α for the conversions and 477 to β for the non-conversions gives us the updated posterior for the Blue button:

Posterior = Beta(2+23, 18+477) = Beta(25,495)

Graphically:

This new distribution reflects what we now believe about the Blue button's true conversion rate, combining both our prior knowledge and the test data.

At this point, we've done something more powerful than produce two conversion rates. We've built two full probability distributions, one for each button. Each curve shows not just a single estimate, but a range of possible conversion rates, weighted by how plausible each one is, given both our prior belief and the observed data.

So instead of asking, "Is 5.4% greater than 4.6%?", we're asking a richer question: What do we actually believe about the true rate behind each button?

And more specifically:

Given everything we know, how likely is it that Green is actually better than Blue?

This isn't something we can answer by comparing averages. Each distribution captures uncertainty, and those uncertainties overlap. What we really want to know is: if we drew one plausible value from each distribution, how often would Green outperform Blue?

There's no simple formula for that, but the logic is straightforward. We can simulate the scenario: draw 10,000 samples from each distribution, compare them one by one, and count the fraction of times Green wins.

In our case:

This gives you a direct, interpretable answer: Based on the data and your prior knowledge, there's a 71% chance that Green truly outperforms Blue.

But real decisions rarely turn on the question of which variant is probably better. That's only the start. Once we have full posterior distributions, we can ask what matters more: how much better, how confidently, how safely, and relative to what?

If we choose Green, what kind of improvement should we expect? This is the expected uplift: the mean difference between Green and Blue.

A small uplift, say 0.2%, may not justify the cost of switching, even if it's likely real. Better isn't always enough. Sometimes, it has to be worth it.

Or consider the opposite: what if we're wrong? Even with 71% odds in Green's favor, there's still a 29% chance we're backing the loser. So we ask: how bad might that mistake be? Will I still be able to pay my rent if I choose Green? What about the Maldives?

That's where Value at Risk comes in. By looking at the 5th percentile of the uplift distribution (the bottom tail of all plausible differences) we can see how far behind Green could plausibly fall. It's a way of quantifying regret before you commit.

Or maybe you're not just asking which variant is better. You're asking whether Green clears the bar your team already set: a minimum acceptable conversion rate, say 6%. In that case, the relevant quantity isn't:

but rather:

You can also ask about scale. Not every win is a win worth acting on. Suppose you're planning a redesign that touches multiple teams and will take weeks to implement. You don't just want to know whether Green is better. You want to know whether it's significantly better, quite literally. If your minimum bar is a one-point improvement, the question becomes: "Is Green better enough to justify the cost of switching?"

Now you're not just testing whether there's a difference. You're testing whether it's big enough to justify a decision

None of these questions are possible within a frequentist framework, at least not without awkward workarounds. In Bayesian analysis, they fall out naturally from the shape of the distribution itself.

Once you have that, you're not just asking which variant won. Instead, you're answering how sure you are, how good the upside is, and how painful the downside could be.

And that's the actual job.

Thaís Steinmuller

Written by Thaís Steinmuller

Content Engineer

Passionate about making complex data accessible and building tools that help teams collaborate effectively around their data.