Chartbeat Headline Testing uses the Thompson Sampling Methodology. Here's how...
What's in an Algorithm?
Choosing between two or more different options is a very general problem in statistics. There are a number of different algorithms available, each with its own advantages and disadvantages.
One fundamental consideration in designing an algorithm is the tradeoff between exploration and exploitation. When you're running a test between different outcomes, the more trials one outcome gets, the more confident you can be about its true value. On one hand (exploration), you would like to give a sufficient number of trials to all of the different options so that you can definitively declare that one option is better than the others (or that options are relatively equal). On the other (exploitation), you want to be playing the best option as often as possible.
One simple and widely used algorithm is to give an equal number of trials to all of the options in the experiment. You can determine statistically how many trials you need in advance to disambiguate your options. This method is very exploration-driven-- until the experiment is over, you are not able to reap the benefits of knowing which option is best. However, this approach has been used successfully by companies like Optimizely.
Another algorithm which skews more heavily towards exploitation is called the epsilon greedy bandit algorithm. This is an example of a more general class of algorithms called multi-armed bandit algorithms. In this example, an epsilon is chosen (say, 10%), and 10% of the time, an option is chosen at random. The other 90% of the time, the option that is currently doing the best is chosen. Whereas this algorithm is guaranteed to find the best arm eventually, in practice it may take a long time for the experiment to conclude.
When we are choosing an algorithm to use for the specific problem of headline testing, we have a few considerations and design objectives in mind:
- Headlines may only live on the home page for a short amount of time, so we need experiments to be able to finish in a reasonable amount of time.
- The click through rate (CTR) for a given headline may depend both on the headline itself and its position on the page, so we need the algorithm to be performant on a wide range of available traffic.
- We want to maximize the amount of engagement with the article.
Breaking Down Thompson Sampling
For these reasons, we selected an algorithm called Thompson Sampling (also known as a Bayesian Bandit algorithm) in order to intelligently split the difference between exploration and exploitation.
In Thompson Sampling, we use Bayesian statistics in order to determine our confidence that each of the headlines is currently the best. We then play the headlines in proportion to our confidence. For instance, imagine we are running an experiment with three headlines, A, B, and C. Based on the data we have seen so far, we think that headline A will win 50% of the time, B will win 30% of the time, and C 20%. Then we will decide to serve headline A 50% of the time, headline B 30% of the time, and C 20% of the time. As new data comes in, we update our beliefs about the different headlines and repeat. One nice thing about this process is that inferior headlines get shown less and less as we are more and more sure that they are inferior. Conversely, when a clear winner starts to emerge, we will play that winner more and more. Companies like Google have also implemented Thompson Sampling algorithms to great success.
One downside of pure Thompson Sampling is that it is fundamentally a regret-minimization algorithm. If you have a headline test with two options that are about equal, the test will take a very long time to complete. From the perspective of getting clicks and engagement on your article, this is okay because you are doing equally well with either headline. Practically, however, we want experiments to finish in a reasonable amount of time. For this reason, in practice we use a slight generalization of Thompson Sampling (also used by Google, for instance) that allows the experiment to finish if we are sufficiently confident that there is little benefit to continuing the experiment.
What Makes Chartbeat Unique
One Chartbeat-specific innovation is that we incorporate reader engagement into our experiment results. Instead of using CTR as the sole determinant whether one headline is better than another, we use a combination of CTR and engagement data to reward headlines that encourage readers to read the articles they are clicking on and to discourage click bait. We think this gives you a better way to assess the true value of a headline.