How to experiment in the world of AI

Written on 11/30/2024

Bandan Singh

In the evolving world of AI and Generative AI, the role of the PM becomes even more vital in bridging technical innovation with business goals through experimentation.

Many Product Managers and leaders are facing the shift from traditional product management (where the methodologies of experimentation were pretty standard and well developed) to building AI-enabled or AI-first products where the same rules of experimentation are no more applicable.

Today, we deep dive into how to approach experimentation in the emerging world of traditional AI and GenAI.

Do checkout guidde: A generative AI platform for business that helps your team create video documentation 11x faster.

500+ company increases CSAT 13% with AI Video

We found this AI tool that creates how-to videos that are user ready in seconds.

One of their case studies is with Encompass, a 500+ employee company that was able among other things to increase CSAT 13% integrating Guidde training videos into Encompass’s AI chatbot. Read more here.

What we like about Guidde:

● Instant storyline: after recording an explainer video, the AI instantly creates a structured storyline with the key highlights.
● Multi-language translations: Your team is able to translate your video into over 40+ languages in just a click.
● Easy sharing: as a link, PDF, MP4 and HTML.
● Seamless Integrations: Integrated with Zendesk, Notion, Slack, Salesforce, and more.

Try it out for free here

Fundamental shifts in approaching experimentation

So far on Productify, I have written about how companies like Booking.com and Netflix have approached experimentation but the primary focus was on how these companies formed the hypothesis and validated it through controlled environments, characteristic of the traditional experimentation approach:

ideation → prototyping → testing → feedback → iteration

In the traditional sense, you as a Product Manager would form a null hypothesis (H0) that there is no difference between the control and experimental groups and an alternative hypothesis (H1) suggesting that a difference exists.

And then you would roll out an A/B test (or multivariate) presenting the two versions (A and B) to different user segments (randomly distributed to ensure unbiased results).

Eventually over a period of time and given enough traffic, you would be able to collect quant data on your primary and secondary metrics to determine if differences between A and B are statistically significant with a low p-value to avoid that the results are purely happening by chance.

Only when you have results from one test, you may learn and move to the next. This approach can lead to longer timelines as each variant must be tested independently before making decisions.

However, in the world of AI, three things change for you as a Product Manager majorly:

Subscribe now

First your circle of collaboration changes. You still collaborate with designers and developers but also increasingly involve data scientists and machine learning engineers due to the technical complexity of AI models.

Secondly, instead of just two or more variants, AI experiments can involve multiple models or configurations of such models (on parameter level or training set level) tested simultaneously (multivariate testing).

This is useful for exploring various algorithms, hyperparameters, or feature sets concurrently.

Thirdly, experiments in AI or with AI are more iterative than sequential. PMs can run multiple experiments simultaneously (e.g., testing different algorithms or hyperparameters) and refine models based on ongoing feedback.

Continuous evaluation is possible, allowing for rapid adjustments to models based on real-time data rather than waiting for a complete experiment cycle to conclude as in traditional sense.

Note: As of now, we only talk about differences in traditional vs. AI-enabled experiments. And we are not yet differentiating between AI and GenAI, that comes later.

You need new ways to measure success of an experiment when testing AI-enabled features

One consistent aspect in product management, irrespective of whichever technologies arrive, is that customer problem always remains the key aspect to be addressed. You cannot band-aid AI on a customer problem. You still have to measure if the customer is happier using your product after you applied an incremental change (AI enabled or not). Hence, knowing your customers well is always the first step.

Now specifically when setting up experiments, the metrics used to evaluate experiment success differ significantly between traditional product management and AI/Generative AI (GenAI) contexts.

While the ultimate goal in both cases remains focused on user metrics, for example:

User Engagement: Time spent on the feature.
Conversion Rates: Percentage of users completing desired actions.
Customer Satisfaction: Surveys or Net Promoter Score (NPS).

the methodologies and specific metrics tracked in AI experiments can vary widely.

Let us understand this with some context. Let us say you’re a PM responsible for recommendation system on an e-commerce platform.

And you find out that given the customer and business needs, you need to make below metrics betters:

Click-Through Rate (CTR) - Main Primary Metric
- Measures how often users click on recommended items compared to how often those items are shown.
Conversion Rate (CR): - Secondary Metric
- The percentage of users who take a desired action (e.g., making a purchase) after interacting with recommended items.
Revenue per User (RPU):- Secondary Metric
- Measures the average revenue generated per user as a result of recommendations.
Engagement Metrics:- Secondary Metric
- Metrics such as time spent on site, session length, and retention rate measure user interaction with the platform post-recommendation.
Satisfaction Metrics:- Secondary Metric
- User satisfaction can be gauged through surveys, Net Promoter Score (NPS), or Customer Satisfaction Score (CSAT), which assess how well recommendations meet user needs.

Now, you would work with your data scientists and machine learning engineers to evaluate the effectiveness of different AI models and configurations for showing recommended items on an e-commerce platform, it's essential to track a variety of metrics that reflect both user engagement and the performance of the recommendation algorithms.

PMs act as translators between technical data science concepts and business needs. They ensure that the data science team understands the business context and goals.

Further, PMs help identify potential risks associated with deploying AI models, including ethical considerations and model bias, ensuring that these issues are addressed proactively.

Given that the primary user-centric metric is the Click-Through Rate (CTR)—the ratio of users who click on recommended items compared to how often those items are shown—here’s a detailed breakdown of relevant metrics to track related to the effectiveness of different AI models internally:

(Let us see users see K number of items in top recommendations)

Precision: Measures how many of the top K recommended items were actually relevant to users.

Recall: Assesses how many relevant items were included in the top K recommendations.

If a system recommends 10 items with 3 relevant ones (precision = 30%) and misses 2 relevant items (recall = 60%), the F1 score is 40%

F1 Score: Balances precision and recall for a comprehensive view of recommendation quality.
Baseline systems often achieve F1 scores between 0.4 and 0.6, depending on the algorithm and data quality.
Advanced hybrid models (e.g., combining collaborative filtering and dynamic behavior analysis) can reach 0.6–0.7 or higher.
Scores above 0.7 are considered excellent but rare in large-scale e-commerce environments due to data sparsity and diverse user preferences

Hence, you as a PM would work with data teams to assess different AI models and configurations for recommending items on an e-commerce platform. It is vital to track a comprehensive set of metrics that includes both user-centric measures like CTR and conversion rates, as well as internal model performance metrics like precision and recall.

By balancing these perspectives through continuous monitoring and iterative improvement, product managers can enhance recommendation systems that not only drive clicks but also improve overall customer satisfaction and revenue generation.

But, how do GenAI experiments differ from AI experiments?

In our example of recommendation system above, we focused on predicting user preferences based on historical data and user interactions. The output was typically a ranked list of recommended items tailored to individual users.

However, GenAI experiments involve creating new content or responses based on input prompts. The output can be text, images, or other media generated dynamically.

Example, on the same e-commerce platform, a generative AI chatbot that creates responses to user queries by synthesizing information from a vast knowledge base is a good use-case for GenAI-enabled feature experimentation.

So let us see how the setup of GenAI-enabled feature would differ. This feature development is led by a Product Manager who is supporting customer service platform that wants to implement a generative AI chatbot to assist users in finding information quickly. You would typically go through below steps:

Along with data teams, you support in developing the chatbot using a generative model like GPT-4 and implement it in a controlled environment first (internal testing).
Use feature gates to gradually roll out the chatbot to select groups of users while monitoring engagement and satisfaction metrics.
Collect real-time feedback from users interacting with the chatbot and iterate on prompts or model parameters based on this feedback.

As in case of recommendation systems, your primary metrics remain user focused, but in case of GenAI experiments you would also additionally track for a chatbot:

User Satisfaction Scores: Collected through surveys or feedback mechanisms assessing how well generated content meets user needs.
Engagement Metrics: Time spent interacting with generated content or the number of follow-up questions asked by users.
Content Relevance: Evaluated through qualitative assessments of generated outputs against user expectations.

While traditional AI experiments rely heavily on metrics like accuracy, precision, and F1 score to evaluate model performance, Gen AI experiments prioritize user-centric metrics that assess the quality and relevance of generated content. This distinction reflects the different objectives and outputs associated with each type of experimentation.