This post shows how I analyzed a MailChimp email list, including all the data from weekly newsletters for 2014, using MySQL. It was inspired by this excellent tutorial: Performing cohort analysis by Micheal Herman. The email list I’m working with consists of about 18,000 subscribers.
I wanted to answer questions such as:
- How many emails do active subscribers open on average?
- How active are the subset of users who bought a product during the digital flash sale in March of this year?
- Who hasn’t opened an email at all in 2014?
I’m using MySQL so I can go beyond the standard MailChimp reporting dashboard and answer very specific questions. There is too much data for Excel so MySQL is the perfect tool for this analysis.
Editor’s note: this post is intended to give a flavour of this analysis, rather than a comprehensive step-by-step walk through. I’m still learning MySQL and MailChimp, but I know enough to do some useful analysis. If you have any questions leave a comment or shoot me an email.
Getting the raw data from MailChimp
To answer the questions posed above, I needed data. Lots of it.
From MailChimp, I downloaded the current subscriber list, the segment who purchased a product in 2014, and lists of all subscribers who opened at least one of the emails from each newsletter campaign in 2014. This required a fair bit of digging around in MailChimp and lots of downloads!
I downloaded the lists as CSV files, which I could open in Excel and clean up before pulling into MySQL.
MailChimp has written about how to download an email list or segment in this article, so I won’t re-invent the wheel here.
However, I’ll show you how to find the list of subscribers who opened a particular campaign, as this took a little bit more digging. It’s in the Reports section, behind the Opened number, shown in red in the following diagram (you can then export the list):
After downloading I opened the CSV files in Excel to check and clean the data. I had to remove comma’s from people’s names, as these were causing issues when importing the CSV file. For example, there were a small number of names in the data written as Joe Bloggs, MD or Joe Bloggs, III. Since the comma delineates the columns in a CSV file, these names were being split in two on import into MySQL.
Also, the dates needed some formatting so that MySQL would recognize them as dates automatically when I imported them. Initially, the dates in the CSV files were formatted as 7/25/2014 for example, which then kept appearing as 0000-00-00 00:00:00 when imported into MySQL.
After several futile attempts to re-format the date field in MySQL, I realized it would be much easier to simply re-format the dates in Excel so they matched the SQL format: yyyy-mm-dd. (I realized this when reading this Stack Overflow post.)
I also deleted the columns of data I didn’t require and added an email_id column — a unique, sequential number for each email address that would act as my primary key in MySQL for the subscriber list. Using an Excel vlookup formula against the email addresses, I transferred this into the campaign and buyers tables, to act as the foreign key.
Setting up the database and importing the data
I needed the following tables in the database:
1. Current active subscribers
2. Campaign history data (list of subscribers who opened emails in 2014)
3. Buyers — the segment of subscribers who purchased a product
First, I needed to create the database in MySQL, which I called email_database. I created it using the following code:
create database email_database
Next I added three tables: the email_list table (current subscribers), the campaign table (campaign details) and the buyers table (people who purchased a product). The MySQL code to create the current subscribers table was as follows:
email_id int(11) primary key,
email_name varchar(50) not null,
email_add varchar(150) not null,
sub_date datetime not null
The tables for campaign and buyers were similar.
At this stage, it’s a good idea to check that these worked as expected. This was done using show tables to see the tables listed, and then with describe email_list to see the details for each table in turn (in this example, the email_list table).
Then I imported the data in the CSV files into MySQL. For example to import the CSV containing the campaign data (called upload.csv) into the campaign table, I used the following code:
INTO TABLE campaign
FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’
IGNORE 1 ROWS
This code tells MySQL that the table fields are separated by commas, and that a carriage return in the CSV file (the ‘\n’) denotes a new row in the MySQL table. The last line of this code tells MySQL to ignore the header row in the CSV file when it’s being imported.
Reviewing individual newsletter campaigns
Approximately 3,000 to 4,000 people from this list open a newsletter. I wanted to review the composition of these subscribers and work out whether they were mostly new sign-ups or long-time readers.
As the email list has grown significantly this year, the results are skewed towards the newer subscribers, who are more likely to still be opening emails.
The MySQL code puts the subscribers into cohorts based on which month they signed up and then relates that to a specific newsletter campaign to see the breakdown, as follows:
count(email_list.email_add) as subs
join campaign on email_list.email_id = campaign.email_id
group by cohort
order by cohort
The output from MySQL is a table showing the campaign date (campaign_date in the table) versus subscriber sign-up months (cohort in the table), with a count of how many subscribers are in each bucket (subs in the table), as follows:
We can then chart this in Excel. For example, let’s look at the four newsletter campaigns from August. The chart shows which months the subscribers signed up in.
Whilst it’s interesting to see this breakdown, I don’t feel it gives us anything really concrete to act on. So let’s keep digging!
What’s the average number of emails active subscribers have opened in 2014
This is a fairly easy question to answer since we have all the campaign data for 2014 available. The query is as follows:
Recall the campaign list consists of everyone who opened a newsletter in 2014, i.e. the 1,000 email addresses who opened the first newsletter on 1/7/2014, the 1,300 who opened the second newsletter on 1/14/2014, etc. There are a lot of duplicates in there obviously (representing people who have opened multiple newsletters).
Running this query counts the total number of emails opened in 2014 and divides it by the number of unique email addresses in the campaign table, to calculate how many emails are opened per email address on average in 2014. (Everyone on the campaign list opened at least one email.)
The result: active subscribers opened on average 9 emails each in 2014.
Analysis of subscribers who haven’t opened an email at all in 2014
Let’s take a look at the subscribers who haven’t opened an email at all in 2014. First off, I put them in cohorts by sign up month to see when they signed up and if there was any discernible trend:
count(email_id) as subs
where email_id not in (
group by cohort
I ran this query, copied the results table into Excel and then represented it graphically as follows:
It’s interesting to see this visually — there are lots of people signed up, even in recent months, who haven’t opened an email. So what if we want to find those people’s email addresses and try to engage them? Or perhaps remove them from the list?
To get this list of emails is pretty simple: fetch all of the subscribers whose email addresses do not appear in the campaign table, using the following query:
where email_id not in (
This is a useful list. We can potentially email these people and ask them if they still want to be subscribers. Perhaps make them opt-back in, to re-engage them. If they’re really not interested then we can remove them from the list, thereby saving money by not having inactive users on our list (MailChimp charges based on how many users).
Analysis of the product buyers segment
There were 374 email addresses on the product buyers list. I wanted to see how many opened the weekly newsletter and when, and also create a list of buyers from most-engaged to least-engaged (in terms of how many newsletters they open).
I checked how many of the buyers had opened an email in 2014, using the code:
where email_id in
select email_id from campaign
The answer? 332.
So, 42 of the 374 buyers haven’t opened an email in 2014. This sounds like a good group to reach out to with a personal email.
To find the email addresses of these 42, I modify the previous script to return email addresses, rather than counting them. I also need to choose emails that are not in the 2014 campaign data, as follows:
where email_id not in
select email_id from campaign
This gives me a list of 42 email addresses.
What about how many product buyers open the weekly newsletter?
I used the following code to group the campaigns into cohorts, and the subset of those where their email is also in the buyers table:
count(email_id) as buyers_count
where email_id in
select email_id from buyers
group by cohort
The results of this analysis is a table as follows:
which we can then represent graphically in Excel as follows:
So over time, the trend is for slightly fewer of the buyers opening emails. (The product sale was a 3-day flash sale which occurred in March, hence the big jump shown in the chart for that month.)
So let’s find the least engaged ones, find out when they last opened an email in 2014 and then rank them in order from least-engaged to most-engaged.
select buyers.email_id, buyers.email_add, date_format((campaign_date), ‘%Y/%m/%d’) as recent_email
join campaign on buyers.email_id = campaign.email_id
group by buyers.email_id
order by recent_email
This gives the following output (email addresses grayed out and showing only the first 7), where recent_email in the table represents the date of the last newsletter email they opened:
From this list, we can reach out to these users, maybe with a special offer to re-engage them.
This is really the tip of the iceberg. Using MySQL allows you to really dig through the data and uncover insights. If you’ve used SQL to analyze email or other marketing data, or have any thoughts on what I’ve written, please share your experience in the comments.