When Conducting A/B Tests isn't Advisable
One of the key dilemmas of young PMs and product teams is whether they should test every product change to establish the goodness of the change. The short answer is — a/b tests aren’t advisable in all situations. In fact, they sometimes hurt the teams that are bullish on them.
So I thought of covering when conducting a/b tests aren’t advisable in this post. Before we start, an announcement!
The applications for the 3rd cohort of Product-Led Growth are now open! It’s highly recommended for those who want to learn advanced skills in Product Growth with a super-smart peer group of existing PMs and Founders.
Outcomes
The course is heavily focussed on practical learning that you can apply on the job. A tangible outcome for existing cohort members has been moving to senior product roles. For founders, the outcome of the course is creating/modifying their growth strategy for the product.
Here are few testimonials about the course to check out 🤓
Why is this course different?
Read on for more context.
First, most courses around growth are focussed on marketing. This one is a product-first course. PMs and Founders need to have product-led growth because spending money to grow isn’t always a viable option.
Second, courses usually focus on learning but not creating immediate impact in your current job. The content, tools, and templates provided in this course are such that many of cohort members built their own product growth plan. Building your own product growth plan a part of curriculum, and that’s proof of value right there :)
Third, the course has around 200+ objective questions around frameworks and case-studies that help you evaluate whether you really understood the concepts and can apply it.
Add to that — capstone project, case studies, interview prep for FAANG, community of super-smart folks are other benefits 🔥
You can know more about the course at https://www.pmcurve.com/
I would be doing shortlist of applications for this cohort by July 15th and there are limited seats. So apply today if you are considering it.
Back to the post,
Let’s start with two major limitations of a/b tests which provide an insight into why a/b tests aren’t a perfect tool.
Limitations of a/b tests
Long-term impact is hard to predict for features through a/b tests. I wrote about how the long-term impact is hard to predict through a/b in What Goes Wrong in A/B Tests.
It is hard to know the long-term behavior of the users. As most A/B tests only run only for a couple of weeks, they are almost always telling about the short-term behavior, but not the long-term ones.
An example of this which all of us have faced in our lives is the sales pitch. Sometimes, a company will use a sales pitch to create a sense of urgency for users, making false claims, and converting them into paid users. In a couple of business cycles, metrics like conversion and revenue would go up.
But give it a long enough time, and the problems with conversion will start appearing due to bad brand perception the sales team has created. In this situation, you can avoid this by devising counter-metrics like customer satisfaction rate along with conversion rates.
While a mature product team can avoid focussing on short-term wins, the unpredictability of long-term behavior is inevitable in many cases. A novel feature can get attention of the users, which can translate into other positive behaviors like engagement and transactions. However, once the user gets used to it, the feature stops paying off.
An example of this is adding confetti post successful task completion/ transactions. Users may notice it the first time and that will improve the recall and retention of the app. But the novelty wears off after a while and the feature stops paying returns.
Another major limitation of a/b tests is they don’t account for a future cohort of users. You can’t test the change on future user segments who haven’t downloaded and using your app right now. So when a product grows and attract new set of users, a successful feature may lack feature-market fit for the new segment of users.
Overcoming the Limitations via Backtesting
To overcome the limitations and measure long-term impact of features, we have to run a/b tests for a longer period of time. The fallacy with this approach is that it limits the total number of tests we can run on the user base.
The solution arises in the form of reverse A/B testing where we make the new experiences the default, and let the test group experience the old site. It’s also known as backtesting.
In backtesting, we can only allocate the back test a small percentage like 1-5% of user-base. None of the new features would be visible to this group. At the end of every quarter, you can look at how the back test group key metrics are faring as compared to the rest of users who are experiencing all the features the team has shipped in the quarter.
When to Avoid a/b Tests
Even if we overcome the limitations via back-testing, we should avoid doing a/b tests in other scenarios. There are five such instances:
When there is no clear hypothesis
When it is used as a conflict management tool
When the user base is small
When changes are substantial or strategic
When testing could lead to negative experiences
Let’s cover them one by one!
When there is no clear hypothesis
It is important to have a clear hypothesis on why the change should work. Lack of clear hypothesis means we wouldn’t know why the test worked or didn’t work.
Further, if the test didn’t work, we wouldn’t have learnt anything from it because we don’t know why we did the test in the first place.
In the long run, this approach wastes useful resources and promotes activities over outcome in the PM org.
When it is used as a conflict management tool
Sometimes, PM teams start using a/b test as a tool to resolve conflicts around whether a change should be made or not. The conflict is often a symptom of something else, like sometimes a lack of clear hypothesis, or a friction between two people.
It’s important to understand the source of conflict before using a/b tests.
When the user base is small
Statistical significance is a key factor in the validity of A/B testing results. If you have too small a user base, you would take a long time to reach statistically significant results.
This situation often occurs with new products that are still seeking Product-Market Fit (PMF), or with certain parts of the user funnel where user numbers are limited.
Pre-PMF Products: When a product is in its early stages, the user base may be quite small. In this case, conducting experiments and expecting statistically significant results is challenging.
Parts of the Funnel: Certain stages in a user's journey or funnel may have limited number of users. For instance, an eCommerce app can have a large number of site visitors (top of the funnel) but only a small percentage of them might reach the checkout stage (bottom of the funnel) or support page. Experimenting on checkout/support page has the same problems as pre-PMF products.
So what should you do when # of users becomes an issue? You can rely on qualitative feedback through user interviews, and look at industry benchmarks and competitors to build conviction.
When changes are substantial or strategic
For major redesigns, or strategic shifts, a/b testing isn’t an effective tool. In such cases, user testing or market research may be more appropriate.
You should also build strong conviction before making these major changes. Otherwise, you may face backlash like Snapchat did where it had to redesign its redesign. From Vox,
Snap decided to redesign the app after concluding that it was difficult for people to use, preventing adoption by a wider audience. CEO Evan Spiegel also wanted to separate personal content from public content, so the redesign moved stuff from brands and celebrities to one side of the app, and left private friend posts on the other.
Many users hated it.
More than a million people signed a Change.org petition asking the company to restore the old app. Celebrities like Kylie Jenner complained on Twitter about not using Snap any longer. Most potently, Snap added fewer users and made less in ad revenue than expected, citing the redesign as a culprit.
Now Snap is redesigning its redesign to feel a little more like it was before
Compare this to what happened when Facebook launched News Feed in 2006. They also faced backlash from users. From Techcrunch,
There has been an overwhelmingly negative public response to Facebook’s launch of two new products yesterday. The products, called News Feed and Mini Feed, allow users to get a quick view of what their friends are up to, including relationship changes, groups joined, pictures uploaded, etc., in a streaming news format.
Many tens of thousands of Facebook users are not happy with the changes. Frank Gruber notes that a Facebook group has been formed called “Students Against Facebook News Feed”. A commenter in our previous post said the group was closing in on 100,000 members as of 9:33 PM PST, less than a day after the new features were launched. There are rumours of hundreds of other Facebook groups calling for a removal of the new features.
So what did Facebook do? Facebook didn’t roll back the News Feed. In fact, it was one of the best strategic calls they ever made. It allowed them to monetise by introducing ads. They did launch additional privacy controls for News Feed and Mini-Feed to address user concerns.
So when changes are substantial or strategic, focus on building conviction should be paramount. A/B tests aren’t the answer.
When testing could lead to negative experiences
LinkedIn ran social experiments on 20 million users over 5 years. The study found that relatively weak social connections were more helpful in finding jobs than stronger social ties.
From the NYT ,
LinkedIn ran experiments on more than 20 million users over five years that, while intended to improve how the platform worked for members, could have affected some people’s livelihoods, according to a new study.
In experiments conducted around the world from 2015 to 2019, Linkedin randomly varied the proportion of weak and strong contacts suggested by its “People You May Know” algorithm — the company’s automated system for recommending new connections to its users. Researchers at LinkedIn, M.I.T., Stanford and Harvard Business School later analyzed aggregate data from the tests in a study published this month in the journal Science.
Many exerts pointed out that the findings suggest that some users had better access to job opportunities or a meaningful difference in access to job opportunities. This created a negative PR for LinkedIn. Many people further objected to this experiment because they felt it was another case of large social platform not gaining informed consent from users before experimenting on them.
An experiment which may have negative impact on some of the users needs to be thought through as it can create PR disasters.
To summarise it all, it is important to understand limitations of a/b tests and where they may not be advisable. Doing experiments have a cost in terms of time and resources, and sometimes, the benefits don’t outweigh the costs.
Another update I wanted to share is I would be doing a free zoom event around ‘ChatGPT for Product Managers’ on 8th July. 500+ folks are already registered for it. You can register for the event on LinkedIn.
That would be all for this post. Have a good day,
Deepak