Survival and retention analysis using MySQL

I’m working my way through this book, Data Analysis Using SQL and Excel, at the moment and chapter 6 is all about survival and retention modelling. I learn best when I can attack real life problems, so I took some of the lessons from this chapter and applied them to the MailChimp email data I already had in a MySQL database.

Previously I looked at setting up a database and doing some basic MySQL analysis on MailChimp data, as well as some more in-depth analysis such as creating a histogram of email subscribers.

This post takes it a step further by looking at subscriber behaviour within the email campaign data. What can we say about how long people remain active subscribers? For a subscriber who has been active for a given length of time, how likely are they to continue being active?

Learning more about the subscribers on the email list

My first step was to create a list of the subscribers who opened emails in 2014, how many they opened, when they first opened, last opened and if they unsubscribed, what date they did.

I created a table showing these fields and also added tenure (which I defined as the number of days between the first and last opened emails). For subscribers that were still active the unsubscribe date field was null.

The following query does this:

select c.email_add, email_count, first_email, last_email, email_tenure, unsub_date
from
(
     select email_add,
          count(*) as email_count,
          min(campaign_date) as first_email,
          max(campaign_date) as last_email,
          datediff(max(campaign_date),min(campaign_date)) as email_tenure
     from campaign
     group by email_add
) c
     left join
(
     select email_add, max(c_unsub_date) as unsub_date
     from campaign_unsubs
     group by email_add
) c_u
     on c.email_add = c_u.email_add
order by 5 desc

This query creates two intermediary tables, joins them together and then returns relevant data from this joined table.

The first table selects data from the campaign table, including a count of emails opened, first open date (using the min function) and last open date (using the max function) and calculates the difference between the two using the datediff function.

The second table selects email addresses and unsubscribe dates from the unsubscribers table.

The two tables are joined using a left join where the email addresses in both tables match up. This ensures all the records from the first table are included, not just those with an unsubscribe date in the second table. For subscribers in table one that do not appear in table two, they will have a null value in the unsubscribe date column.

The output of this query is a table as follows:

Email table output
Email table output

This shows, for example, the email address entry in the first line has opened 41 emails so far in 2014 (woohoo!), opened their first email on 1/7/2014 and opened their most recent email on 11/13/2014 (my data is now a couple of weeks old). This results in an email tenure, or active period, of 310 days. The null in the final column, unsub_date, shows that this subscriber has not unsubscribed, i.e. they are still active.

Creating a histogram by tenure (number of days between first opened and last opened emails)

The following query will group the data from the previous table by tenure. In other words, it counts how many subscribers have identical number of days between their first opened and last opened emails (i.e. identical tenures). I also added a count of how many unsubscribes there were for each tenure value.

The code to do this is as follows:

select email_tenure, count(*) as num_count,
     sum(case when email_unsub is not null then 1 else 0 end) as count_unsub
from
(
     select c.email_add, email_count, first_email, last_email, email_tenure, email_unsub
     from
     (
     select email_add,
          count(*) as email_count,
          min(campaign_date) as first_email,
          max(campaign_date) as last_email,
          datediff(max(campaign_date),min(campaign_date)) as email_tenure
     from campaign
     group by email_add
     ) c
          left join
     (
          select email_add, max(c_unsub_date) as email_unsub
          from campaign_unsubs
          group by email_add
     ) c_u
          on c.email_add = c_u.email_add
) s
group by email_tenure

The inner table is simply the table from the previous query. The sum(case when . . . is not null then 1 else 0 end) expression counts how many times the unsubscribe date column is not null for each given tenure. I added a group-by-tenure clause as the last line to get the data into tenure “buckets”.

The output of this query is as follows:

Tenure histogram
Tenure histogram

The first row shows there were 2,309 subscribers whose first recorded and last recorded email open events were the same day. This represents people who have only opened one email through the whole of 2014. It probably includes a number of new subscribers who are opening the most recent email for the first time. The first line also shows that 28 of these people aso unsubscribed after opening their first email.

Survival analysis of this tenure histogram

Survival analysis is concerned with the time-to-some-event, in this case unsubscribing from the newsletter.

Next step was to copy the MySQL table above into Excel (into columns A, B and C) so I could run some further calculations and create a survival curve for email subscribers.

I used the following Excel formulas in columns D, E and F, as follows:

To get a cumulative sum of subscribers, I added the following formula in column D: =SUM($B2:$B$128)

In column E, I calculate the hazard ratio, which is the number of unsubscribes divided by the cumulative population, using the formula: =C2/D2

In column F, the survival is then calculated as one minus the hazard, multiplied by the prior survival: =F2*(1-E2)

The Excel table then looks like this:

Survival Excel table
Survival Excel table

I then used the Excel charting function to create a survival chart, plotting the survival, s, against tenure:

Survival Curve 100 scale
Survival Curve on 1 to 100% scale

This is kinda difficult to see, so I cut the y-axis to have a minimum value of 86%, which makes it easier to read % survival values for given tenures.

Survival Curve
Survival Curve

This shows that the probability of an email subscriber being active for 300+ days, i.e. still be opening emails, is around 91%. This sounds incredibly high to me, so I have to caveat this analysis because of the underlying data that I’ve used. This analysis was done using the campaign data, which consists of those who opened at least one email in 2014, hence are necessarily the more active group of subscribers.

Retention analysis of the email list

Another way to understand the subscriber dynamics is to look at retention, i.e. for subscribers who opened their first email X days ago, how many are still active (have not unsubscribed).

I do this by putting subscribers into cohorts based on how many days ago they opened their first email (called days_ago in the code below), then counting how many are still active (the numactives in the code below) and what proportion are still active (the retention in the code below).

I used the following query to calculate retention:

select -datediff(first_email,’2014-11-13′) as days_ago, count(*) as numstarts,
     sum(case when email_unsub is null then 1 else 0 end) as numactives,
     avg(case when email_unsub is null then 1 else 0 end) as retention
from
(
select c.email_add, email_count, first_email, last_email, tenure, email_unsub
from
(
     select email_add,
          count(*) as email_count,
          min(campaign_date) as first_email,
          max(campaign_date) as last_email,
          datediff(max(campaign_date),min(campaign_date)) as tenure
     from campaign
     group by email_add
) c
     left join
(
     select email_add, max(c_unsub_date) as email_unsub
     from campaign_unsubs
     group by email_add
) c_u
          on c.email_add = c_u.email_add
) s
group by days_ago
order by 1

The output is a MySQL table as follows:

Retention MySQL table
Retention MySQL table

which I copied into Excel to create a retention curve. The data was exactly the same as the MySQL table, as follows:

Retention Excel table
Retention Excel table

I then plotted these retention values in Excel:

Retention curve
Retention curve

You notice certain things: it starts at 100% as per the Survival curve, it generally decreases over time (we’ve retained fewer older subscribers) but it’s jagged, unlike the survival curve. This represents the different age cohorts of subscribers behaving differently, or rather, having different qualities. In other words, certain age cohorts have worse retention rates than others (compare the troughs with the spikes).

Combining the Survival and Retention curves

As a final touch, I combined the survival and retention curves on to the same chart:

Survival and Retention curve
Survival and Retention curves

Leave a Reply

Your email address will not be published. Required fields are marked *