A few years back, a writer friend who worked at a prominent blog explained to me that her employer's sensationalist, runon headlines were crafted with the help of a cutting-edge A/B testing system. Each article would go live with multiple versions of the same headline; the version to draw the most clicks would become the canonical version. This, I thought, this is the future: algorithms drawing on user feedback to adjust our texts on the fly for maximum impact.
In 2012, I was among the millions of Americans who noticed that Barack Obama and his campaign staff had started writing some kinda weird messages to me. They'd arrive with subject lines like "Hey" or "Would Love to Meet You." They got talked about at dinner parties. After Election Day, explanations were offered in the press. These weren't the brainchild of a mad creative subject line marketing genius. No, they were carefully tested beforehand among multiple variations sent to smaller groups of supporters, and only the most successful versions made it to the full campaign email list. Obama's campaign raised hundreds of millions of dollars through email direct marketing.
Both of these examples of data-backed writing were early indicators of today's textual landscape, in which a great deal of the writing we see online—especially things like headlines and instructions and marketing copy—has been tested for efficacy before we read it, or is being tested on us as we read it. Our written language is changing, as any Viral Nova headline makes clear, and the assumption is that it is leveraging user data to more perfectly cater to readers' desires.
This is the assumption, anyway. But a new report by Martin Goodson of Qubit, emphatically titled Most Winning A/B Test Results are Illusory, argues that the methodology behind much of the analytic-driven decision-making on the web is flawed. It warns against common pitfalls such as multiple testing, in which many A/B tests are performed simultaneously. The report states, "there is always likely to be a feel-good factor from a few wins if you try many tests." My grasp on this isn't perfect, but that sounds a lot like what the aforementioned blog and the Obama campaign were both doing. If so, the most immediate effect of these tests would have been to force writers to experiment, coming up with weirdo alternatives to feed into the system. Some real stinkers likely made it through the testing process simply because they yielded false positives, but at the same time, simply encouraging copywriters to innovate and improvise also probably led to some big hits. Hey.
In its skewering of commonly used testing methodologies, the report serves a useful reminder that the algorithms that shape online cultural production, writing included, are themselves cultural agents. Like a human author, they can carry with them certain flaws and biases, which then get passed on into the text. As the practice of writing continues to change, it will be of critical importance to understand it as an interplay among authorial voices, readers' behaviors, and the algorithms that mediate between the two.
For now, we at Rhizome will continue to fly blind in our headline-writing and fundraising email-sending. Algorithms of the future, we look forward to working with you.
Update: Amelia Showalter replied to our initial inquiry after this was published, writing:
yes, there were certainly lots of test results that could have been false positives. That was one reason we only rarely elevated any single positive test result into "best practice" status, and often would try to reproduce results to verify things -- or we'd just move on to other tests (it was a pretty frenetic atmosphere). It also didn't hurt that we had such enormous sample sizes! Working with a large national email list definitely had its perks. :)