Create Retention Curves
Retention curves (opens in a new tab) allow you to see how many users are still active at a given point in time and compare different cohorts of users to see how well your product retains engagement. This article doesn't dedicate too much time to how to analyze retention curves since there is plenty of content dedicated to this out there.
The approach is very similar to the general approach to data modeling: you'll prepare a set of Measures that define the percent of users that are still active in a month and add attributes that help you understand the variation. See the article Data Modeling Best Practices
For retention curves, it's oftentimes useful to add experiments as attributes: did you try something in the product to retain users more effectively? Maybe a new onboarding flow, a new user flow?
For this walk-through, we're going to look at the user engagement on the tech news aggregation website Hacker News. The dataset is available on BigQuery's public datasets and is kept up to date.
As with all tools you build in Hashboard, retention curves will be as powerful as the data model you produce for them. For large user tables, it probably makes sense to prepare this data in your data warehouse, but it's also possible to calculate retention curves on the fly using a SQL-based data model. See Add Data Model
To prepare our curves, we are going to track whether a user is active or not in any given month. So that all of our users are being compared fairly, we'll also track the number of months since a user signed up - this will be the x-axis of our retention curve.
You should be targeting a data model that looks something like the following table:
|userid||user signup month||activity month||months since signup||is_active|
|carlos@hashboard||August 1, 2020||August 1, 2020||0||1|
|carlos@hashboard||August 1, 2020||September 1, 2020||1||1|
|carlos@hashboard||August 1, 2020||October 1, 2020||2||0|
|carlos@hashboard||August 1, 2020||November 1, 2020||3||0|
|nathaniel@hashboard||July 1, 2020||July 1, 2020||0||1|
|nathaniel@hashboard||July 1, 2020||August 1, 2020||1||1|
|nathaniel@hashboard||July 1, 2020||September 1, 2020||2||1|
|nathaniel@hashboard||July 1, 2020||October 1, 2020||3||1|
|nathaniel@hashboard||July 1, 2020||November 1, 2020||4||1|
For the Hacker News dataset the query looks something like this (using BigQuery syntax):
with all_months as ( select distinct timestamp_trunc(timestamp, month) as month from `bigquery-public-data.hacker_news.full` ), active_months as ( select distinct `by` as userid , timestamp_trunc(timestamp, month) as month from`bigquery-public-data.hacker_news.full` ), users as ( select `by` as userid , min(timestamp_trunc(timestamp, month)) as first_month , max(timestamp_trunc(timestamp, month)) as last_month from `bigquery-public-data.hacker_news.full` group BY userid ) select u.userid , m.month , first_month as cohort_month , cast(first_month as string) as cohort_month_str , date_diff(DATE(m.month), DATE(u.first_month), MONTH) as month_number -- could add cohort -- , experiment_cohort as experiment , if(m.month <= u.last_month, 1.0, 0.0) as unbounded_active , if(a.month IS NOT NULL, 1.0, 0.0) as active from users as u cross join all_months as m left join active_months as a on a.userid = u.userid and a.month = m.month where m.month >= u.first_month ;
A few notes on the resulting columns:
- userid - self-explanatory, this is string representing our user
- month - this is a date representation of the month in which an activity happens
- cohort_month - this is the month a user joined the site
- cohort_month_str - this is a bit of a workaround: you will cast your cohort month to a string so you can use it as an attribute and cut by it in Hashboard
- month_number - this is the number of months since a user joined the platform, this will be the x axis of our curve
- experiment - this is a randomly assigned experiment for each user. Note that in this column and for this model, each user can only be assigned to a single experimental cohort.
- active - either 1 or 0 for each user in a month. A value of 1 means the user was active in that month and zero means inactive.
- unbounded_active - either 1 or 0 for each user in a month. A value of 1 means that a user was either active in that month or in any month after. This will help us answer the question "of users in this month, what percent of users abandoned us and have not returned since"
From the data exploration page, click Add Data Model
Either select your prepared retention table from your data warehouse or click
SQL Queryand enter a SQL query similar to the one above as the basis for your model. Click
Set up your data model in Hashboard:
Make sure that
monthis selected as your primary date, by selecting the gear next to it.
month_numberas attributes along with any experimental cohort attributes you prepared in your data.
Add two columns as Measures:
- Unbounded Active Retention - is defined as the
- Active Retention - is defined as the
- Active Users (Optional) - if you want to create a retention pivot, you should also add a measure for unique active users in a month:
COUNT(DISTINCT IF(active = 1, userid, null))
- Unbounded Active Retention - is defined as the
- Change the chart type to a
month_numberas the x-axis variable
Unbounded Active Retentionas your measure
- Select some
cohort_month_stras breakouts to see how different cohort months retain users in your product. Alternatively, you can filter for a specific cohort month and breakout by an experimental variable
Note that not breaking out (or filtering) by
cohort_month_str has some
implications for your analysis. This is because not every cohort month has an
equal number of future months. To correct for this, it's possible to use a
Kaplan-Meier estimator which will be left as a separate exercise.
- Make sure you add the Active Users measure specified above.
- Select a
pivot tableas your visualization type from the chart dropdown on the top left.
cohortas your rows.
month_numberas your columns.
- Click into the
controltab on the right-hand side and make sure you sort rows by
cohortdescending and sort the columns by
- As the measure, select
Active Usersas the calculation, select
Percentage of Row Maxand finally as the Color scale, select
Percentage Color Scale.