Create Retention Curves

Retention curves (opens in a new tab) allow you to see how many users are still active at a given point in time and compare different cohorts of users to see how well your product retains engagement. This article doesn't dedicate too much time to how to analyze retention curves since there is plenty of content dedicated to this out there.

The approach is very similar to the general approach to data modeling: you'll prepare a set of Measures that define the percent of users that are still active in a month and add attributes that help you understand the variation. See the article Data Modeling Best Practices

For retention curves, it's oftentimes useful to add experiments as attributes: did you try something in the product to retain users more effectively? Maybe a new onboarding flow, a new user flow?

For this walk-through, we're going to look at the user engagement on the tech news aggregation website Hacker News. The dataset is available on BigQuery's public datasets and is kept up to date.

Prepare curves in your data warehouse

As with all tools you build in Hashboard, retention curves will be as powerful as the data model you produce for them. For large user tables, it probably makes sense to prepare this data in your data warehouse, but it's also possible to calculate retention curves on the fly using a SQL-based data model. See Add Data Model

To prepare our curves, we are going to track whether a user is active or not in any given month. So that all of our users are being compared fairly, we'll also track the number of months since a user signed up - this will be the x-axis of our retention curve.

You should be targeting a data model that looks something like the following table:

userid	user signup month	activity month	months since signup	is_active
carlos@hashboard	August 1, 2020	August 1, 2020	0	1
carlos@hashboard	August 1, 2020	September 1, 2020	1	1
carlos@hashboard	August 1, 2020	October 1, 2020	2	0
carlos@hashboard	August 1, 2020	November 1, 2020	3	0
nathaniel@hashboard	July 1, 2020	July 1, 2020	0	1
nathaniel@hashboard	July 1, 2020	August 1, 2020	1	1
nathaniel@hashboard	July 1, 2020	September 1, 2020	2	1
nathaniel@hashboard	July 1, 2020	October 1, 2020	3	1
nathaniel@hashboard	July 1, 2020	November 1, 2020	4	1

For the Hacker News dataset the query looks something like this (using BigQuery syntax):

with all_months as (
  select
    distinct timestamp_trunc(timestamp, month) as month
  from `bigquery-public-data.hacker_news.full`
),
active_months as (
  select
    distinct `by` as userid
    , timestamp_trunc(timestamp, month) as month
  from`bigquery-public-data.hacker_news.full`
),
users as (
  select
    `by` as userid
    , min(timestamp_trunc(timestamp, month)) as first_month
    , max(timestamp_trunc(timestamp, month)) as last_month
  from `bigquery-public-data.hacker_news.full`
  group BY userid
)
select
  u.userid
  , m.month
  , first_month as cohort_month
  , cast(first_month as string) as cohort_month_str
  , date_diff(DATE(m.month), DATE(u.first_month), MONTH) as month_number
-- could add cohort
--  , experiment_cohort as experiment
  , if(m.month <= u.last_month, 1.0, 0.0) as unbounded_active
  , if(a.month IS NOT NULL, 1.0, 0.0) as active
from users as u
  cross join all_months as m
  left join active_months as a
    on a.userid = u.userid and a.month = m.month
  where m.month >= u.first_month
;

A few notes on the resulting columns:

userid - self-explanatory, this is string representing our user
month - this is a date representation of the month in which an activity happens
cohort_month - this is the month a user joined the site
cohort_month_str - this is a bit of a workaround: you will cast your cohort month to a string so you can use it as an attribute and cut by it in Hashboard
month_number - this is the number of months since a user joined the platform, this will be the x axis of our curve
experiment - this is a randomly assigned experiment for each user. Note that in this column and for this model, each user can only be assigned to a single experimental cohort.
active - either 1 or 0 for each user in a month. A value of 1 means the user was active in that month and zero means inactive.
unbounded_active - either 1 or 0 for each user in a month. A value of 1 means that a user was either active in that month or in any month after. This will help us answer the question "of users in this month, what percent of users abandoned us and have not returned since"

Create the model in Hashboard

From the data exploration page, click Add Data Model
Either select your prepared retention table from your data warehouse or click SQL Query and enter a SQL query similar to the one above as the basis for your model. Click Create Model
Set up your data model in Hashboard:
- Make sure that month is selected as your primary date, by selecting the gear next to it.
- Add cohort_month_str, month_number as attributes along with any experimental cohort attributes you prepared in your data.
- Add two columns as Measures:
  - Unbounded Active Retention - is defined as the average of column unbounded_active
  - Active Retention - is defined as the average of column active
  - Active Users (Optional) - if you want to create a retention pivot, you should also add a measure for unique active users in a month:
```
COUNT(DISTINCT IF(active = 1, userid, null))
```
Click Publish

Create retention curves

Change the chart type to a line
Select month_number as the x-axis variable
Select Unbounded Active Retention as your measure
Select some cohort_month_str as breakouts to see how different cohort months retain users in your product. Alternatively, you can filter for a specific cohort month and breakout by an experimental variable

Note that not breaking out (or filtering) by cohort_month_str has some implications for your analysis. This is because not every cohort month has an equal number of future months. To correct for this, it's possible to use a Kaplan-Meier estimator which will be left as a separate exercise.

GraphQL API Prod and Staging Environments