Building an Experiment Scorecard

experiment scorecard

This is a blog post in the series of articles about Experimentation Program Governance.

In this article, you will learn where experiments go wrong in the execution stage and how that can send you down the wrong path. You will also learn how to create a scorecard to have a predictable way of evaluating the quality of your experiments, which will ensure that you run high quality tests regardless of the outcome.

Experiments fail

An important factor to consider in every Experimentation Program is that experiments can fail. That is the very nature of running an experiment. You don’t know the actual outcome. You might have a rough idea or hope for what the outcome could be, but other than that, it’s not in your control.

An experiment could fail because the hypothesis you originally created was not the right one and that is ok.

An experiment can also fail because of other issues and these might well be within your control.

Not all experiments are created equal

If you take two experiments and compare them together, there are various factors that could tell you that they weren’t created alike. This is even more apparent in organizations where testing programs grow in an ad hoc way.

1. Variances in the capability of the team or person who set up the test

As an Experimentation Program grows within a company, the people creating experiments will have varying capabilities. Product managers and others who have never been part of the experimentation process have been tasked with running experiments. To expect them to run tests to the same quality as the core team is an exercise in naiveté.

Will they come up with a strong hypothesis?
Will they run meaningful tests?
Will they run more complex tests?

A lot of it comes down to how these new individuals are onboarded and trained. (We have an entire playbook on that coming soon.)

2. Poor or no QA can still mess up a test

If a test goes live without thorough QA, the results could be skewed. The decisions made based on those results could be wrong but the team who set up and analyzed the test may be unaware of it. The corners cut could have dire consequences.

3. Lack of good practices

This could be linked to onboarding and governance, but in many cases, we have seen that it stems from the fact that organizations expect teams to chase vanity metrics (we talked about them before – you can read part 1 and part 2 here).

In extensive research carried out, we have uncovered how some CROs add details to the test later in the process, redefine the hypothesis long after the test is finished or alter important information. The motivating factor here is to ensure that the tests are seen as a success. Unfortunately, the outcome thereafter is that the company makes decisions on flawed outputs.

Defining the qualities of a good experiment

In order to create an Experiment Scorecard, you need to first define the qualities of a good experiment. It needs to be objective in nature to make for an easy review. And if two people were to score a test based on the scorecard, they should arrive at the same conclusion.

Here are some good places to start when building a scorecard:

  1. Source of the experiment ideaIs this idea a random idea or did it come from research insights? Can you connect the dots?
  2. Strength of the Hypothesis Has the experiment been created with a strong hypothesis? Is it a random statement or has it been built using a toolkit like this?
  3. PrioritizationIs this experiment running because it meets clear prioritization criteria or did it bypass it?
  4. Ties in with business initiatives Is the experiment aligned with business goals or are they just random tests?
  5. Details pre-liveDid the experiment setup capture all the required information?
  6. PipelineDid the experiment follow all necessary steps in the pipeline or did it skip or jump back and forth?
  7. Details post-live Was additional information added or amended after the experiment was paused or stopped?
  8. Clear analysisDid it meet the standards set for analysis of the metrics?
  9. Insights & actionable stepsDoes the experiment have clear insights from the analysis? Are there any actionable steps? The lack of these could indicate that it wasn’t looked at properly after the reports were in.

Building the scorecard

Once you have picked the variables that define your scorecard, you need to give them weighted values.

Assign a positive or negative value as an indicator of its impact on the experiment.

The impact determines the value. A higher value indicates that the variable has a higher impact.

You must clearly define the scoring based on criteria for each of the variables.

Let’s take an example – Strength of the hypothesis

We could create four breakpoints to score this:

  1. No hypothesis present: SCORE -25
  2. Hypothesis present but statement doesn’t constitute a hypothesis: SCORE -10
  3. Hypothesis present but no toolkit used: SCORE 0
  4. Hypothesis present and toolkit used correctly: SCORE 25

By creating a weighted model, you remove the ambiguity in the scoring process.

Putting the scorecard to work

The Experiment Scorecard is a very important tool to use if you want to ensure that your Experimentation Program operates at the highest level. It removes subjective thinking and allows for clear evaluation.

However, this can only be implemented if you also do the following:

  1. Implement a peer review pre-mortem and post-mortem – This means the scoring is done at two separate points and you reduce the chances of low quality experiments going out.
  2. Peer review must be independent – The person conducting the peer review must have no stake in the experiment or its outcome. This is to avoid bias in this process.
Manuel da Costa

A passionate evangelist of all things experimentation, Manuel da Costa founded Effective Experiments to help organizations to make experimentation a core part of every business. On the blog, he talks about experimentation as a driver of innovation, experimentation program management, change management and building better practices in A/B testing.