Creating a histogram from MailChimp data using MySQL

This post digs deeper into the MailChimp data which I analyzed in my previous post. In this analysis, I’ve focused on the distribution of subscribers and what was the most recent date they opened an email.

Specifically, I wanted to answer:

  • What is the distribution of subscribers by campaign date that opened an email in 2014?
  • And what are their email addresses?

The way I approached this problem was to break it down into its constituent parts, tackle each of those and then build that back up into a single query.

Step 1: Find the email addresses that didn’t open an email at all in 2014

This is the first cohort. Subscribers who haven’t opened an email in 2014 at all. This means they won’t be in the campaign data table, so the query to identify this cohort is straightforward. First, I used a sub-query to identify those who are in the campaign data table. Then, in the outer query, I selected all subscribers who were not in the sub-query table.

The code is as follows:

select email_id, email_add, 0 as last_open
from email_list
where email_id not in
     (select email_id from campaign)

Running this query produces a list of email addresses for subscribers that are not not in the campaign data table.

Step 2: Count how many distinct emails there are in the campaign data table

This represents everyone who has opened at least one email through 2014. To do this we use the distinct operator, as follows:

select count(distinct email_add) from campaign

The output of this query is a single number, but it gives us a good starting point for the analysis of the campaign data.

Step 3: Create list of email addresses ordered by date they most recently opened an email

We can do this relatively easily by combining the max() function with a group by clause to find the maximum date for each email address, using the following code:

select email_id, email_add, max(campaign_date) as last_open
from campaign
group by email_id
order by 3

The result of this query is a list of email addresses and the most recent date that subscriber opened an email. The last line of the MySQL query orders the list by the date column, from oldest to newest, so the least active subscribers are at the top of the list.

Step 4: Combine the tables from Step 1 and Step 3 using the UNION operator

From Step 1 and Step 3 we have the two tables of email data that we are interested in. Now we can combine them into a single table using the UNION operator, which combines rows. It’s necessary to have the same columns of data in each table, hence why I added the 0 as last_open code in Step 1.

The code is as follows:

select email_id, email_add, 0 as last_open
from email_list
where email_id not in
     (select email_id from campaign)
union
     select email_id, email_add, max(campaign_date) as last_open
     from campaign
     group by email_id
order by 3

Running this query results in a list of email addresses with the date of their most recently opened email in 2014, or, if they haven’t opened an email at all in 2014, a 0.

Step 5: Creating the histogram in MySQL

First I looked at the campaign email for 2014 and determined the distribution there, grouped by date of the most recently opened campaign email. This is a count of how many people opened the email for each campaign, from the campaign data table.

This is done with the following code:

select last_open, count(*) from
(
     select email_id, email_add, max(campaign_date) as last_open
     from campaign
     group by email_id
) a
group by last_open
order by 1

This can then be charted in Excel as follows:

Histogram of 2014 email opens

Histogram of 2014 email opens

Unsurprisingly this shows many people on the list have opened the most recent campaign emails. This is good, it shows there is a large and highly engaged group, whose most recent email open was this November.

It’s still too granular however, so I grouped the campaign dates into months and then created a histogram showing the monthly distributions.

This is done easily by grouping together the campaign dates into cohorts, using the following piece of code in the first select statement:

extract(year_month from last_open) as cohort

and then grouping by cohort.

In Excel, the chart is then as follows:

Histogram of email opens in 2014 by month

Histogram of last email open in 2014 by month

This is definitely easier to read than the first chart.

Step 6: Bringing together the data into one histogram

We combine the two tables from Steps 1 and 5 to achieve this, using the UNION operator, as follows:

/* union of “No emails opened in 2014” and histogram for monthly opens in 2014 */
select “2013 last open” as cohort, count(email_add)
from email_list
where email_id not in
     (select email_id from campaign)
union
select extract(year_month from last_open) as cohort, count(*) from
(
     select email_id, email_add, max(campaign_date) as last_open
     from campaign
     group by email_id
) a
group by cohort
order by 1

From this, we can create the full MailChimp histogram showing which month users most recently opened an email, as follows:

Full email histogram

Full email histogram

So we can see exactly where the problem lies.

The roughly 3,000 emails on the list who haven’t opened an email this year need to be resolved, and either removed from the list or encouraged to engage with the brand again.Their email addresses were identified in steps 1 and 3 of this post.

Any questions or thoughts on the analysis? Please leave a comment below.

Leave a Reply

Your email address will not be published. Required fields are marked *