Deep dive into Mailchimp email data using MySQL

This post shows how I analyzed a MailChimp email list, including all the data from weekly newsletters for 2014, using MySQL. It was inspired by this excellent tutorial: Performing cohort analysis by Micheal Herman. The email list I’m working with consists of about 18,000 subscribers.

I wanted to answer questions such as:

  • How many emails do active subscribers open on average?
  • How active are the subset of users who bought a product during the digital flash sale in March of this year?
  • Who hasn’t opened an email at all in 2014?

I’m using MySQL so I can go beyond the standard MailChimp reporting dashboard and answer very specific questions. There is too much data for Excel so MySQL is the perfect tool for this analysis.

Editor’s note: this post is intended to give a flavour of this analysis, rather than a comprehensive step-by-step walk through. I’m still learning MySQL and MailChimp, but I know enough to do some useful analysis. If you have any questions leave a comment or shoot me an email.

Getting the raw data from MailChimp

To answer the questions posed above, I needed data. Lots of it.

From MailChimp, I downloaded the current subscriber list, the segment who purchased a product in 2014, and lists of all subscribers who opened at least one of the emails from each newsletter campaign in 2014. This required a fair bit of digging around in MailChimp and lots of downloads!

I downloaded the lists as CSV files, which I could open in Excel and clean up before pulling into MySQL.

MailChimp has written about how to download an email list or segment in this article, so I won’t re-invent the wheel here.

However, I’ll show you how to find the list of subscribers who opened a particular campaign, as this took a little bit more digging. It’s in the Reports section, behind the Opened number, shown in red in the following diagram (you can then export the list):

MailChimp list extract
MailChimp list extract

After downloading I opened the CSV files in Excel to check and clean the data. I had to remove comma’s from people’s names, as these were causing issues when importing the CSV file. For example, there were a small number of names in the data written as Joe Bloggs, MD or Joe Bloggs, III. Since the comma delineates the columns in a CSV file, these names were being split in two on import into MySQL.

Also, the dates needed some formatting so that MySQL would recognize them as dates automatically when I imported them. Initially, the dates in the CSV files were formatted as 7/25/2014 for example, which then kept appearing as 0000-00-00 00:00:00 when imported into MySQL.

After several futile attempts to re-format the date field in MySQL, I realized it would be much easier to simply re-format the dates in Excel so they matched the SQL format: yyyy-mm-dd. (I realized this when reading this Stack Overflow post.)

I also deleted the columns of data I didn’t require and added an email_id column — a unique, sequential number for each email address that would act as my primary key in MySQL for the subscriber list. Using an Excel vlookup formula against the email addresses, I transferred this into the campaign and buyers tables, to act as the foreign key.

Once the raw MailChimp CSV data was ready, I setup a relational database in MySQL.

Setting up the database and importing the data

I needed the following tables in the database:

1. Current active subscribers
2. Campaign history data (list of subscribers who opened emails in 2014)
3. Buyers — the segment of subscribers who purchased a product

First, I needed to create the database in MySQL, which I called email_database. I created it using the following code:

create database email_database

Next I added three tables: the email_list table (current subscribers), the campaign table (campaign details) and the buyers table (people who purchased a product). The MySQL code to create the current subscribers table was as follows:

create table email_list
email_id int(11) primary key,
email_name varchar(50) not null,
email_add varchar(150) not null,
sub_date datetime not null

The tables for campaign and buyers were similar.

At this stage, it’s a good idea to check that these worked as expected. This was done using show tables to see the tables listed, and then with describe email_list to see the details for each table in turn (in this example, the email_list table).

Then I imported the data in the CSV files into MySQL. For example to import the CSV containing the campaign data (called upload.csv) into the campaign table, I used the following code:

LOAD DATA LOCAL INFILE ‘C:/Users/Ben/Desktop/SQL/upload.csv’
INTO TABLE campaign

This code tells MySQL that the table fields are separated by commas, and that a carriage return in the CSV file (the ‘\n’) denotes a new row in the MySQL table. The last line of this code tells MySQL to ignore the header row in the CSV file when it’s being imported.

Reviewing individual newsletter campaigns

Approximately 3,000 to 4,000 people from this list open a newsletter. I wanted to review the composition of these subscribers and work out whether they were mostly new sign-ups or long-time readers.

As the email list has grown significantly this year, the results are skewed towards the newer subscribers, who are more likely to still be opening emails.

The MySQL code puts the subscribers into cohorts based on which month they signed up and then relates that to a specific newsletter campaign to see the breakdown, as follows:

select campaign_date, extract(year_month from sub_date) as cohort,
count(email_list.email_add) as subs
from email_list
join campaign on email_list.email_id = campaign.email_id
group by cohort
order by cohort

The output from MySQL is a table showing the campaign date (campaign_date in the table) versus subscriber sign-up months (cohort in the table), with a count of how many subscribers are in each bucket (subs in the table), as follows:

Campaign subscriber counts

We can then chart this in Excel. For example, let’s look at the four newsletter campaigns from August. The chart shows which months the subscribers signed up in.

August newsletter subscriber signups

Whilst it’s interesting to see this breakdown, I don’t feel it gives us anything really concrete to act on. So let’s keep digging!

What’s the average number of emails active subscribers have opened in 2014

This is a fairly easy question to answer since we have all the campaign data for 2014 available. The query is as follows:

select count(email_id)/count(distinct email_add) as average_emails_opened from campaign

Recall the campaign list consists of everyone who opened a newsletter in 2014, i.e. the 1,000 email addresses who opened the first newsletter on 1/7/2014, the 1,300 who opened the second newsletter on 1/14/2014, etc. There are a lot of duplicates in there obviously (representing people who have opened multiple newsletters).

Running this query counts the total number of emails opened in 2014 and divides it by the number of unique email addresses in the campaign table, to calculate how many emails are opened per email address on average in 2014. (Everyone on the campaign list opened at least one email.)

The result: active subscribers opened on average 9 emails each in 2014.

Analysis of subscribers who haven’t opened an email at all in 2014

Let’s take a look at the subscribers who haven’t opened an email at all in 2014. First off, I put them in cohorts by sign up month to see when they signed up and if there was any discernible trend:

select extract(year_month from sub_date) as cohort,
count(email_id) as subs
from email_list
where email_id not in (
select email_id
from campaign
group by cohort

I ran this query, copied the results table into Excel and then represented it graphically as follows:

Subscribers with no opens in 2014

It’s interesting to see this visually — there are lots of people signed up, even in recent months, who haven’t opened an email. So what if we want to find those people’s email addresses and try to engage them? Or perhaps remove them from the list?

To get this list of emails is pretty simple: fetch all of the subscribers whose email addresses do not appear in the campaign table, using the following query:

select email_id, email_add, sub_date
from email_list
where email_id not in (
select email_id
from campaign

This is a useful list. We can potentially email these people and ask them if they still want to be subscribers. Perhaps make them opt-back in, to re-engage them. If they’re really not interested then we can remove them from the list, thereby saving money by not having inactive users on our list (MailChimp charges based on how many users).

Analysis of the product buyers segment

There were 374 email addresses on the product buyers list. I wanted to see how many opened the weekly newsletter and when, and also create a list of buyers from most-engaged to least-engaged (in terms of how many newsletters they open).

I checked how many of the buyers had opened an email in 2014, using the code:

select count(buyers.email_add) from buyers
where email_id in
select email_id from campaign

The answer? 332.

So, 42 of the 374 buyers haven’t opened an email in 2014. This sounds like a good group to reach out to with a personal email.

To find the email addresses of these 42, I modify the previous script to return email addresses, rather than counting them. I also need to choose emails that are not in the 2014 campaign data, as follows:

select buyers.email_add from buyers
where email_id not in
select email_id from campaign

This gives me a list of 42 email addresses.

What about how many product buyers open the weekly newsletter?

I used the following code to group the campaigns into cohorts, and the subset of those where their email is also in the buyers table:

select campaign_date as cohort,
count(email_id) as buyers_count
from campaign
where email_id in
select email_id from buyers
group by cohort

The results of this analysis is a table as follows:

Buyer cohort analysis

which we can then represent graphically in Excel as follows:

Buyers cohort analysis chart

So over time, the trend is for slightly fewer of the buyers opening emails. (The product sale was a 3-day flash sale which occurred in March, hence the big jump shown in the chart for that month.)

So let’s find the least engaged ones, find out when they last opened an email in 2014 and then rank them in order from least-engaged to most-engaged.

/* buyers bundle most recent opened email, then rank from least active */
select buyers.email_id, buyers.email_add, date_format((campaign_date), ‘%Y/%m/%d’) as recent_email
from buyers
join campaign on buyers.email_id = campaign.email_id
group by buyers.email_id
order by recent_email

This gives the following output (email addresses grayed out and showing only the first 7), where recent_email in the table represents the date of the last newsletter email they opened:

Buyers email activity 2014

From this list, we can reach out to these users, maybe with a special offer to re-engage them.

This is really the tip of the iceberg. Using MySQL allows you to really dig through the data and uncover insights. If you’ve used SQL to analyze email or other marketing data, or have any thoughts on what I’ve written, please share your experience in the comments.

12 thoughts on “Deep dive into Mailchimp email data using MySQL”

    1. Hi John,

      The recent_email field is an alias field that I’ve created using the “as” syntax, so won’t be in your csv files. It’s based off the campaign_date, which is a field in my campaign table. Here’s the syntax (in bold) where I created the recent_email alias:

      SELECT buyers.email_id, buyers.email_add, date_format((campaign_date), ‘%Y/%m/%d’) AS recent_email
      FROM buyers
      JOIN campaign ON buyers.email_id = campaign.email_id
      GROUP BY buyers.email_id
      ORDER BY recent_email

      For more info see:

      Thanks, Ben

  1. great article and iam totally new to this and would like to query into mailchimp data as well. but Iam having difficulty. I downloaded the dev art excel add in and i wanted to query into a table called “listmergefields”.

    My question iam trying to to run a sql query to return members for a particular store. And my problem is the “localstore” is stored in the listmergefields table and is not a column. So, how do i retrieve the data out from the table seeing that the list of local store is not a column by itself.

    usually if local_store is a column and if i want to see member who is in a particular local store i could just put something like “where localstore=’store A'” but since it is not a column so iam confuse how to do i .

    1. Hey Anisur,

      The scripts are all listed in the grey boxes in the article. The data is my own MailChimp data, so I can’t share that publicly. The article explains how to download your own CSV of mailchimp data and create a database table from that.


      1. Hi Ben,

        Just found this site, awesome resources.

        I’m thinking that if there was a way to grab the data of opens/clicks that can be downloaded manually from each report through the API, you would have mentioned it, but I might as well ask.

        Is there any way to access this data through the API to avoid having to download each CSV? I’m interested in looking at the dates that each subscriber opens or clicks an email to compare it to the date they join the list, which I could get how you instruct in this article, but I’m kind of lazy and there are many emails to download the reports for.

        Thank you!

  2. Hi!
    Great article and this is the biggest reason why I need to learn MySQL!

    Do you think that this could be achieved without MySQL knowledge using a CRM/CDP/DMP?


  3. Great article Ben! It looks like it’s very helpful to get that kind of analysis done on your email subscribers. I might have to do the same thing for my email subscribers.
    It sounds like there was a bit of manipulation needed to the CSV files to get them to load into MySQL. Is this process something you’ll look to repeat in the future?

Leave a Reply

Your email address will not be published. Required fields are marked *