2015 was a challenging year for humanity. The plight of refugees was never far from the headlines, as multiple catastrophes unfolded across the globe.
This case study has two aims: first, to show facts on the numbers and details of refugees resettled into the U.S. in 2015, and, second, to discuss the methodology to help and inspire others to explore the data.
- Where From?
- Where To?
- What Religious Affiliation?
- Appendix 1: Sourcing the Raw Data
- Appendix 2: Setting up the Data in SQL
- Appendix 3: Analysis in SQL
- Appendix 4: Visualizing in Tableau
The Refugee Processing Center (RPC) provides the technical systems to process refugees resettling in the U.S. and they release public data on the numbers, countries of origin, religious affiliations and destinations of refugees resettled into the U.S.. I’ve investigated the 2015 data and present the analysis and methodology below.
I came across the RPC datasets through the excellent Data Is Plural newsletter – a weekly list of curated data sources.
I wanted to look at the data and answer some basic questions about the origins, destinations and religious affiliations of the refugees.
In the second half of this post, I delve into the methodology for this case study, from sourcing the data to creating the final charts in Tableau.
In 2015, the U.S. accepted a total of 66,517 refugees from 74 different countries.
Here’s a chart showing a breakdown by country, ordered by largest to smallest number (I’m only showing the first 50 of the 74 countries, to fit on the screen):
And here are just the 20 largest, to better see the details:
What’s interesting is that the first 5 countries accounted for almost 75% of the total refugees resettled in the U.S. in 2015, the first 10 countries for just over 90% of the refugees and the first 20 countries for nearly 99% of the total number. The remaining 54 countries therefore accounted for only around 1% of the total.
The following table shows the 10 countries from where the largest number of refugees were resettled in the U.S. in 2015:
|#||Country of Origin||Number of Refugees||Total||% of Total||Cumulative %|
|3||Dem. Rep. Congo||8,206||66,517||12.3||53.9|
To me, Bhutan was the most surprising country on this list. It’s famed for measuring Gross National Happiness and appears to be getting happier, so the large number of resettled refugees seems to contradict the idea of a mountain paradise, a veritable Shangri-La. It seems likely that at least some of these refugees being resettled are from Bhutan’s forgotten people: those of Nepalese ethnicity that lost their citizenship in the 1990’s and have been living in limbo since.
This map shows the Lower 48 U.S. states shaded according to how many refugees they resettled in 2015. The darker colors correspond to larger numbers of refugees, the lighter shades to fewer numbers (so Texas resettled the most, and Montana and Wyoming, did not resettle any).
Note: Alaska and Hawaii were not shown on the map to achieve a better fit on the screen, but both did resettle a small number of refugees in 2015.
The following chart shows the numbers by state:
Zooming in to show just the larger states:
As you’d expect, it’s the largest and most populous U.S. states resettling the most number of refugees.
By combining the state and origin data into a single visualization, I created this tiled map showing which states in the U.S. resettled refugees, from the largest nine origin countries (by count). As before, the darker the shade of the map, the larger the number of refugees resettled.
See the visualization at a larger scale on Tableau here.
What Religious Affiliation?
The following chart shows the major religious groups of the resettled refugees, as listed in the RPC dataset:
At first glance it looks like approximately equal numbers of refugees claim Christian and Moslem affiliations, but look a little closer and you see there are other sub-groups of Moslems and Christians listed (e.g. Moslems who explicitly list Suni or Shiite). So let’s classify the religions into broader groups and see what the totals look like:
|Group||Count||Total||% of Total|
So in fact, the largest group of refugees by religious affiliation were Christian, representing 46% of the refugees resettled. Not far behind, around 40% of refugees identified with a branch of Islam. That left approximately 14% who identified with various other different religious groups, or were atheists, or the information was not available.
Case Study Methodology Notes
Appendix 1: Sourcing the Raw Data
I originally found out about this dataset via the excellent Data is Plural newsletter, which is a weekly collection of interesting and esoteric datasets and recommended to anyone interested in data.
The raw data for refugee numbers is publicly available from the Refugee Processing Center website.
Go to the Reports > Interactive Reporting menu, where you’ll see the following menu:
For this case study, I downloaded datasets 2 and 3 out of the 4 options shown above. Clicking each in turn takes you to the next menu where you can choose the specifics for the dataset. To get the 2015 data, I entered the start and end dates of 2015, selected all of the Nationalities and all of the Religions:
This will create the report in your browser, from where you can download it to your desktop. Tip: Download as CSV, not Excel, because the format is tabular and therefore much easier to work with off the bat.
Rename the two CSV files to remove the spaces, as these can cause issues when we try to import into PostgreSQL later. I went with:
At this stage, you should now have two CSV files: i) Arrivals by Destination and Nationality, and ii) Arrivals by Nationality and Religion:
As a final clean-up step before importing to PostgreSQL, delete the top 3 rows of each Excel sheet (you can see them in the image above). The table column headings start in row 4, so we want to get rid of what’s above.
Appendix 2: Setting up the data in SQL
Let’s now look at importing and analyzing that raw data in SQL. In this case, I’ve used PosgreSQL, although any other SQL environment would suffice.
First, I need to create two tables which I will load my data into. I’ve added id numbers to both (since we didn’t have one in the CSV files) and set all of the fields to “text” datatype for the time being. We’ll adjust this as required in the next step. The code:
CREATE TABLE ref_destination_data ( id SERIAL NOT NULL, Textbox87 text, Textbox82 text, nat_definition4 text, region_name_3 text, textbox37 text, Category3 text, textbox39 text, Assur_DestinationCity1 text, Cases3 text, Cases4 text ); CREATE TABLE ref_religion_data ( id SERIAL NOT NULL, Textbox126 text, Textbox127 text, nat_definition3 text, region_name_2 text, textbox36 text, relig_definition1 text, Cases text, Cases2 text );
creates two new tables in your database. They will show in the object browser window when you refresh.
Next, let’s import the two CSV files, using the following code (replace the filepath with your own, pointing to the CSV files you downloaded in Step 1):
-- import data from CSV COPY ref_destination_data(Textbox87,Textbox82,nat_definition4,region_name_3,textbox37,Category3,textbox39,Assur_DestinationCity1,Cases3,Cases4) FROM '/Users/benlcollins/Documents/mx_arrivals_destination_nationality.csv' CSV HEADER DELIMITER ','; COPY ref_religion_data(Textbox126,Textbox127,nat_definition3,region_name_2,textbox36,relig_definition1,Cases,Cases2) FROM '/Users/benlcollins/Documents/mx_arrivals_nationality_religion.csv' CSV HEADER DELIMITER ',';
Check your data has imported correctly with the following code:
SELECT * FROM ref_destination_data; SELECT * FROM ref_religion_data;
You should get an output looking like this now:
It looks great at first glance!
But take a look down at rows 2845 and 2846 of the ref_destination_data table or rows 336 and 337 of the ref_religion_data table. Something doesn’t look right. Turns out the reports from the Refugee Processing Center each contain two tables, which are duplicated but with different ordering (for example first ordered by Nationality, then by Destination).
So let’s delete these duplicate records from each table, as follows:
-- delete the duplicate data DELETE FROM ref_destination_data WHERE ref_dest_id > 2844; DELETE FROM ref_religion_data WHERE ref_rel_id > 334;
Great, that fixed that issue. Now our data tables are in much better shape, but not quite ready for analysis.
Before we can start, there’s a couple of final preparation steps we need to do. Firstly, we need to convert the datatypes of the number columns from “text” to a suitable number datatype, so that we may perform mathematical operations on them. Secondly, we’ll rename the column headings to something more meaningful.
So, let’s deal with the datatype issue. I need to remove the commas from the text values, with the REPLACE function. Then I cast the column as an integer with the USING clause, because the text columns cannot automatically be cast to integer. The code is as follows:
-- change datatype for numerical columns in destination data ALTER TABLE ref_destination_data ALTER COLUMN textbox37 SET DATA TYPE integer USING (replace(textbox37, ',','')::integer); ALTER TABLE ref_destination_data ALTER COLUMN textbox39 SET DATA TYPE integer USING (replace(textbox39, ',','')::integer); ALTER TABLE ref_destination_data ALTER COLUMN cases3 SET DATA TYPE integer USING (replace(cases3, ',','')::integer); ALTER TABLE ref_destination_data ALTER COLUMN cases4 SET DATA TYPE integer USING (replace(cases4, ',','')::integer); -- change datatype for numerical columns in religion data ALTER TABLE ref_religion_data ALTER COLUMN textbox36 SET DATA TYPE integer USING (replace(textbox36, ',','')::integer); ALTER TABLE ref_religion_data ALTER COLUMN cases SET DATA TYPE integer USING (replace(cases, ',','')::integer); ALTER TABLE ref_religion_data ALTER COLUMN cases2 SET DATA TYPE integer USING (replace(cases2, ',','')::integer);
Lastly, let’s change the column headings to something more meaningful, with the following code:
-- change the column headings for destination data table ALTER TABLE ref_destination_data RENAME textbox87 TO start_date; ALTER TABLE ref_destination_data RENAME textbox82 TO end_date; ALTER TABLE ref_destination_data RENAME nat_definition4 TO state; ALTER TABLE ref_destination_data RENAME region_name_3 TO period; ALTER TABLE ref_destination_data RENAME textbox37 TO num_state; ALTER TABLE ref_destination_data RENAME category3 TO origin; ALTER TABLE ref_destination_data RENAME textbox39 TO num_state_dest; ALTER TABLE ref_destination_data RENAME assur_destinationcity1 TO destination_city; ALTER TABLE ref_destination_data RENAME cases3 TO num_unique; ALTER TABLE ref_destination_data RENAME cases4 TO total; -- change the column headings for religion data table ALTER TABLE ref_religion_data RENAME textbox126 TO start_date; ALTER TABLE ref_religion_data RENAME textbox127 TO end_date; ALTER TABLE ref_religion_data RENAME nat_definition3 TO origin; ALTER TABLE ref_religion_data RENAME region_name_2 TO period; ALTER TABLE ref_religion_data RENAME textbox36 TO num_origin; ALTER TABLE ref_religion_data RENAME relig_definition1 TO religion; ALTER TABLE ref_religion_data RENAME cases TO num_unique; ALTER TABLE ref_religion_data RENAME cases2 TO total;
Well, that’s all the prep we need to do with the tables, so let’s start the analysis.
Appendix 3: Analysis in SQL
To calculate refugee numbers by Country of Origin, we use a SQL query and group by the Country of Origin, as follows:
-- summarize by origin, from highest to lowest SELECT origin , sum(num_unique) AS number_refugees , total , round((sum(num_unique)*1.0/total)*100,1) AS refugee_ratio FROM ref_destination_data GROUP BY origin, total ORDER BY 2 DESC;
which gives an output like this:
To calculate a cumulative % total (so I can easily see how “big” the top 10 or 20 countries are) I used this more complex query, which uses a WITH CTE and a cross-join (cartesian product) to create a running total:
WITH ref_stats AS ( SELECT origin , sum(num_unique) AS number_refugees , total , round((sum(num_unique)*1.0/total)*100,1) AS refugee_ratio FROM ref_destination_data GROUP BY origin, total ORDER BY 2 DESC ) SELECT a.origin, a.number_refugees, a.total, a.refugee_ratio,SUM(b.refugee_ratio) FROM ref_stats a, ref_stats b WHERE b.number_refugees >= a.number_refugees GROUP BY a.origin, a.number_refugees, a.total, a.refugee_ratio ORDER BY 2 DESC LIMIT 20;
The LIMIT clause just restricts the output table to the first 20 results, which is what I’m interested in here. The output from this query is this table shown at the top of the blog post here.
To find the number of refugees resettled into each State, we use a similar SQL query to the first one above, grouping by state this time, as follows:
-- summarize data by state SELECT state , num_state AS state_refs , total AS total_refugees , round((num_state*1.0/total)*100,2) AS refugee_ratio FROM ref_destination_data GROUP BY state, num_state, total ORDER BY 1,2 ASC;
We can modify this slightly to show the states ordered by number of refugees:
SELECT state , num_state AS state_refs , total AS total_refugees , round((num_state*1.0/total)*100,2) AS refugee_ratio FROM ref_destination_data GROUP BY state, num_state, total ORDER BY 2 DESC;
which gives an output like this:
Finally, if we want, we can easily calculate the mean and standard deviation for the number of refugees admitted to each state, and again for the number of refugees from each country of origin:
-- summary statistics by state SELECT round(avg(number_refugees)) as average_by_state , round(stddev(number_refugees)) as std_deviation_by_state FROM ( SELECT state, sum(num_unique) AS number_refugees FROM ref_destination_data GROUP BY state ) t1; -- Average refugees into each state: 1357 -- Standard deviation: 1439 -- summary statistics by Country of Origin SELECT round(avg(number_refugees)) as average_origin , round(stddev(number_refugees)) as std_deviation_origin FROM ( SELECT origin, sum(num_unique) AS number_refugees FROM ref_destination_data GROUP BY origin ) t2; -- Average refugees from each country of Origin: 899 -- Standard deviation: 2717
From the Religion data, we can summarize the data to see which are the most common religions amongst the refugees:
-- summary by religion, high to low SELECT religion , sum(num_unique) as num_religion , total , round((sum(num_unique)*1.0/total)*100,1) AS refugee_ratio FROM ref_religion_data GROUP BY religion, total ORDER BY 2 DESC;
The output is:
SELECT religious_group , SUM(num_religion) , total , round((sum(num_religion)*1.0/total)*100,1) AS group_ratio FROM ( SELECT religion, CASE religion WHEN 'Ahmadiyya' THEN 'Islam' WHEN 'Animist' THEN 'Other' WHEN 'Atheist' THEN 'Other' WHEN 'Bahai' THEN 'Other' WHEN 'Baptist' THEN 'Christian' WHEN 'Buddhist' THEN 'Other' WHEN 'Catholic' THEN 'Christian' WHEN 'Chaldean' THEN 'Christian' WHEN 'Christian' THEN 'Christian' WHEN 'Coptic' THEN 'Christian' WHEN 'Evangelical Christian' THEN 'Christian' WHEN 'Greek Orthodox' THEN 'Christian' WHEN 'Hindu' THEN 'Other' WHEN 'Jehovah Witness' THEN 'Christian' WHEN 'Jewish' THEN 'Other' WHEN 'Kaaka''i' THEN 'Other' WHEN 'Kirat' THEN 'Other' WHEN 'Lutheran' THEN 'Christian' WHEN 'Methodist' THEN 'Christian' WHEN 'Moslem' THEN 'Islam' WHEN 'Moslem Ismaici' THEN 'Islam' WHEN 'Moslem Shiite' THEN 'Islam' WHEN 'Moslem Suni' THEN 'Islam' WHEN 'No Religion' THEN 'Other' WHEN 'Orthodox' THEN 'Christian' WHEN 'Other Religion' THEN 'Other' WHEN 'Pentecostalist' THEN 'Christian' WHEN 'Protestant' THEN 'Christian' WHEN 'Sabeans-Mandean' THEN 'Other' WHEN 'Seventh Day Adventist' THEN 'Christian' WHEN 'Ukr Orthodox' THEN 'Christian' WHEN 'Uniate' THEN 'Christian' WHEN 'Unknown' THEN 'Other' WHEN 'Yazidi' THEN 'Other' WHEN 'Zoroastrian' THEN 'Other' ELSE 'Other' END AS religious_group , sum(num_unique) as num_religion , total FROM ref_religion_data GROUP BY religion, total ) t GROUP BY religious_group, total ORDER BY 2 DESC;
View the entire SQL file here on GitHub.
That’s all the analysis in SQL, so let’s jump over to Tableau and chart this data.
Appendix 4: Visualizing in Tableau
Firstly, we need to connect Tableau Desktop to our PostgreSQL database so we can get hold of the data to chart.
So open up Tableau and click PostgreSQL on the menu item to get to the following Server Connection menu:
For me, the credentials I used were:
Database: Test (but use whatever name you’ve given your database in PostgreSQL)
and then whatever username and password I use in PostgreSQL.
The first chart I created was the column chart showing the breakdown by country of origin. This is based off the ref_destination_data dataset.
I put the “Origin” dimension into the Columns shelf, added the measure “Num Unique” to the Rows shelf and made sure it was set to “SUM”. This gives me a basic column chart, so I changed the color to red, changed the axis names, added data labels and added a title. The following screenshot shows the chart:
I created the column charts for the breakdown by state and by religion the same way.
To create the more detailed column charts showing only the 10 or 20 largest results, I added a filter. Drag the Dimension onto the Filter shelf, click on it to bring up the drop down menu and select Filter…. Next select Top from the four menu items across the top of the pop-up pane and set the conditions to select only the Top 20 based on the value field (Num Unique), as follows:
Next, create a map of the U.S. states showing number of resettled refugees.
Drag Longitude(generated) onto the Columns shelf, and Latitude(generated) onto the Rows shelf. Then drag the State Dimension onto the bottom of the Marks shelf and use the drop down menu to select “Filled Map”, which should give you a map of the U.S. with states shaded blue.
Next add State Dimension to the Filter shelf and filter out Alaska and Hawaii, to focus on the lower 48. Drag the Num Unique Measure onto the Color icon within the Marks shelf, and finally change the color to a red scale, to get the following map:
I created the small-multiples chart for the Top 9 Countries of Origin to show in which states the refugees found new homes. Hat tip to Ben Jones at Data Remixed for his excellent tutorial, which served as a launchpad for my own explorations.
I won’t repeat all of the steps here (I encourage you to check out the Data Remixed tutorial above), but in essence what I did was create a duplicate of the basic map I created above, and then add Calculated Column and Row fields to divide the chart into small multiples.
The calculated fields are added by right-clicking in the Measures pane, then selecting Create Calculated Field… from the menu.
The code for the Row Calculated Field is:
CASE [Origin] WHEN "Burma" THEN 1 WHEN "Somalia" THEN 1 WHEN "Syria" THEN 1 WHEN "Iraq" THEN 2 WHEN "Bhutan" THEN 2 WHEN "Sudan" THEN 2 WHEN "Dem. Rep. Congo" THEN 3 WHEN "Iran" THEN 3 WHEN "Eritrea" THEN 3 END
And the corresponding code for the Column Calculated Field is:
CASE [Origin] WHEN "Burma" THEN 1 WHEN "Iraq" THEN 1 WHEN "Dem. Rep. Congo" THEN 1 WHEN "Somalia" THEN 2 WHEN "Bhutan" THEN 2 WHEN "Iran" THEN 2 WHEN "Syria" THEN 3 WHEN "Sudan" THEN 3 WHEN "Eritrea" THEN 3 END
It’s critical when you add these two calculated fields to their respective Column and Row shelves, that they go before the Longitude and Latitude fields, and that they are set to be Dimensions and Discrete (they should have blue shading).
The following screenshot shows the final chart and the Calculated Row code:
The final step was to create a blank dashboard sheet, then copy this tiled map into it. Then I added titles for each tile, simply as manual text boxes.
This Tableau chart is available for viewing and downloading here.
Phew, it’s a long post, covering a lot of ground. Even so, we’ve only done a relatively basic analysis of the datasets and there is much more we could do.
It would be interesting to download prior year data and compare and contrast findings year-on-year. It would be interesting to cross-reference these findings against the dates of natural or humanitarian crises to calculate the length of time from crisis to resettlement, and see if this is changing with time.
There is also a demographic dataset, including Age, Gender, Ethnicity, Language and Education, available on the RPC website too, so it would be interesting to download and analyze that data.
Thoughts, comments or corrections? Leave a comment below.