For this project I will investigate a sample set of data from the 2011 UK census. Below is a summary of the analysis implemented:
I will now start the analysis of the dataset starting by importing the moduels that will be needed.
import pandas as pd
pd.set_option('display.max_rows', 130)
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('style.mplstyle')
import folium
from ipywidgets import interact, widgets
from IPython.display import display
import numpy as np
import os.path
from clean import clean_data # own module for cleaning data
from textualize import textualize_data # own module to replace numerical data with textual data
from analyse import get_val_counts # own module to analyse a dataframe
from models import three_dim_model
Below are the two variables which contain the paths to the raw data as well as a path to the cleaned data (if it has been created)
raw_path = '../data/census2011.csv'
clean_path = '../data/census2011_cleaned.csv'
First define a function load_data(path)
which loads raw data from a csv into a pandas dataframe.
def load_data(path):
try:
df = pd.read_csv(path)
return df
except IOError as e:
print("There was an error reading data: " + str(e))
return
df_raw = load_data(raw_path)
df_raw
Now that the data has been loaded we need to clean this data so that it can be used for analysis. This will be done using the clean_data
method defined inside of clean.py
. This file contains a numebr of methods which can be used to clean data as well a function which will perform all the different cleaning methods to the data. All these methods take arguements which allows them to be used for different datasets.
First we will declare some variables which are needed in order to clean the data.
expected_types = {'Region': 'str', 'Residence Type': 'str'} # the expected data types of the given columns
admissible_values = {
"Region": ["E12000001", "E12000002", "E12000003", "E12000004", "E12000005", "E12000006", "E12000007", "E12000008", "E12000009", "W92000004"],
"Residence Type": ["C", "H"],
"Family Composition": [1, 2, 3, 4, 5, 6, -9],
"Population Base": [1, 2, 3],
"Sex": [1, 2],
"Age": [1, 2, 3, 4, 5, 6, 7, 8],
"Marital Status": [1, 2, 3, 4, 5],
"Student": [1, 2],
"Country of Birth": [1, 2, -9],
"Health": [1, 2, 3, 4, 5, -9],
"Ethnic Group": [1, 2, 3, 4, 5, -9],
"Religion": [1, 2, 3, 4, 5, 6, 7, 8, 9, -9],
"Economic Activity": [1, 2, 3, 4, 5, 6, 7, 8, 9, -9],
"Occupation": [1, 2, 3, 4, 5, 6, 7, 8, 9, -9],
"Industry": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, -9],
"Hours worked per week": [1, 2, 3, 4, -9],
"Approximated Social Grade": [1, 2, 3, 4, -9]
} # The accepted values for each column in the dataset (If the data is not restricted then the column can simply be left out)
Now we need to call the function using these variables and the raw data that has been loaded.
The data has be cleaned in the following ways:
The data type of each column is compared to the data type which is expected. If there is not a match then an attempt is made to convert the data. If this conversion cannot be done then nothing is returned and a message will be shown
The values in every column are checked to make sure that it is an admissible value. Any row containing an inadmissable value is removed from the dataframe.
Duplicate data is removed:
The index of the dataframe will also be changed to the specified column.
df_clean = clean_data(df_raw, expected_types, admissible_values, 'Person ID')
df_clean.to_csv(clean_path);
The cleaned data is now stored in the clean_data
variable and has also been saved for future use to the path specified.
Below shows the records that have been removed from the data:
pd.concat([df_raw.set_index('Person ID'),df_clean]).drop_duplicates(keep=False)
As can be seen by the above data the only issue in the original data is a single duplicated Person ID. Both rows are removed as they have differnet regions so it is not posssible to determine which is correct or if the personID is in fact the issue.
Now that the data has been loaded in and cleaned so that it is suitable for analysis, in order for the results of any analysis to be meaningful we need to replace the categoric codes that are stored in the data set with a textual description. This is done using the textualize_data
function defined inside of textualize.py
. This function takes a map of column names onto a map of categorical code to textual representation allowing it to be used for different datasets.
We first define a map containing the textual representation called text_rep
. This mapping allows for the the data to be reused on different areas of the notebook across different datasets, provided the mapping is given. We then call the textualize_data
on the clean data and store it in the df_text
variable.
text_rep = {
"Region": {
"E12000001": "North East",
"E12000002": "North West",
"E12000003": "Yorkshire and the Humber",
"E12000004": "East Midlands",
"E12000005": "West Midlands",
"E12000006": "East of England",
"E12000007": "London",
"E12000008": "South East",
"E12000009": "South West",
"W92000004": "Wales"
},
"Residence Type": {
"C": "Resident in a communal establishment",
"H": "Not resident in a communal establishment"
},
"Family Composition": {
1: "Not in a family",
2: "Married/same-sex civil partnership couple family",
3: "Cohabiting couple family",
4: "Lone parent family (male head)",
5: "Lone parent family (female head)",
6: "Other related family",
-9: "Not required"
},
"Population Base": {
1: "Usual resident",
2: "Student living away from home",
3: "Short-term resident"
},
"Sex": {
1: "Male",
2: "Female"
},
"Age": {
1: "0 to 15",
2: "16 to 24",
3: "25 to 34",
4: "35 to 44",
5: "45 to 54",
6: "55 to 64",
7: "65 to 74",
8: "75 and over"
},
"Marital Status": {
1: "Single (never married)",
2: "Married",
3: "Seperated but still legally married",
4: "Divorced",
5: "Widowed"
},
"Student": {
1: "Yes",
2: "No"
},
"Country of Birth": {
1: "UK",
2: "Non UK",
-9: "Not required"
},
"Health": {
1: "Very good",
2: "Good",
3: "Fair",
4: "Bad",
5: "Very bad",
-9: "Not required"
},
"Ethnic Group": {
1: "White",
2: "Mixed",
3: "Asian",
4: "Black",
5: "Chinese or other group",
-9: "Not required"},
"Religion": {
1: "None",
2: "Christian",
3: "Buddhist",
4: "Hindu",
5: "Jewish",
6: "Muslim",
7: "Sikh",
8: "Other",
9: "None stated",
-9: "Not required"
},
"Economic Activity": {
1: "Economically active: Employeed",
2: "Economically active: Self-employeed",
3: "Economically active: Unemployed",
4: "Economically active: Full-time student",
5: "Economically inactive: Retired",
6: "Economically inactive: Student",
7: "Economically inactive: Looking after home or family",
8: "Economically inactive: Long-term sick or disabled",
9: "Economically inactive: Other",
-9: "Not required"
},
"Occupation": {
1: "Managers, Directors and Senior Officials",
2: "Professional Occupations",
3: "Associate Professional and Technical Occupations",
4: "Administrative and Secretarial Occupations",
5: "Skilled Trades Occupations",
6: "Caring, Leisure and Other Service Occupations",
7: "Sales and Customer Service Occupations",
8: "Process, Plant and Machine Operatives",
9: "Elementary Occupations",
-9: "Not required"
},
"Industry": {
1: "Agriculture, forestry and fishing",
2: "Mining and quarrying; Manufacturing",
3: "Construction",
4: "Wholesale and retail trade",
5: "Accommodation and food service activities",
6: "Transport and storage; Information and communication",
7: "Financial and insurance activities",
8: "Real estate; Professional, scientific and technical; Administrative and support",
9: "Public administration and defence; compulsory social security",
10: "Education",
11: "Human health and social work activities",
12: "Other community, social and personal service activities",
-9: "Not required"
},
"Hours worked per week": {
1: "<=15",
2: "16-30",
3: "31-48",
4: ">=48",
-9: "Not required"
},
"Approximated Social Grade": {
1: "AB",
2: "C1",
3: "C2",
4: "DE",
-9: "Not required"
}
}
df_num = df_clean.copy()
df_text = textualize_data(df_clean, text_rep)
From this point on the variables df_text
and df_num
can be used to analyse the textual/numeric data respectively.
To save time and improve optimisation these variables should be used after the data has been loaded as reading from the CSV is quite time consuming and there is no reason that the data needs to be reloaded.
First we will look at the size of the dataset. Using len()
on the dataframe we can see that there are 569739 records in the clean dataset.
len(df_text)
The below table shows the datatype of each variable in the dataset and its datatype.
df_num.dtypes.to_frame('Datatype')
The Region and Residence Type are both strings however these are stored as generic objects as it would not be possible to store arbritarily sized strings.Pandas cannot guarantee that the data will in fact be a string so does not specify string. All other variables in the dataset are integers. All the data in the data set is categoric and the values stored represent a certain category for each variable.
Each of the following subsections shows different values for each variable as well as how many of each value is in the dataset. By modifying the sort_val_counts
variable below you can set wether or not the value counts should be sorted into descending order.
sort_val_counts = True
get_val_counts(df_text, 'Region', sort=sort_val_counts)
print("Data type: " + str(df_text['Region'].dtype))
We will also use a bar chart and a map to visualise this distribution of population across the different regions. The map is very useful in comparing which areas are most populus however the bar chart is better to compare actual values.
Region_Counts = df_text['Region'].value_counts()
plotRegion = Region_Counts.plot(kind='barh')
plotRegion.set_xlabel('Number of Records', fontsize=15)
plotRegion.set_ylabel('Region', fontsize=15)
plotRegion.set_title('Number of Records for Each Region')
regionMap = folium.Map(location=[53,-1.6], tiles='cartodbpositron', zoom_start=6)
folium.Choropleth(
geo_data = '../data/NUTS1.geojson',
data = Region_Counts,
key_on='feature.properties.nuts118nm',
name = 'choropleth',
fill_color = 'GnBu',
fill_opacity = 0.8,
line_opacity = 1
).add_to(regionMap)
regionMap
get_val_counts(df_text, 'Residence Type', sort=sort_val_counts)
get_val_counts(df_text, 'Family Composition', sort=sort_val_counts)
get_val_counts(df_text, 'Population Base', sort=sort_val_counts)
get_val_counts(df_text, 'Sex', sort=sort_val_counts)
get_val_counts(df_text, 'Age', sort=sort_val_counts)
The pie chart below shows the distribution of ages in the population. It is worth noting when looking at this chart that the lowest age range is larger than the other ranges (excluding 75+).
plot = df_text['Age'].value_counts().sort_index().plot(kind='pie', title='Age distribution', autopct='%1.1f%%')
plot.set_ylabel('')
get_val_counts(df_text, 'Marital Status', sort=sort_val_counts)
print("Data type: " + str(df_text['Residence Type'].dtype))
get_val_counts(df_text, 'Student', sort=sort_val_counts)
get_val_counts(df_text, 'Country of Birth', sort=sort_val_counts)
get_val_counts(df_text, 'Health', sort=sort_val_counts)
get_val_counts(df_text, 'Ethnic Group', sort=sort_val_counts)
get_val_counts(df_text, 'Religion', sort=sort_val_counts)
get_val_counts(df_text, 'Economic Activity', sort=sort_val_counts)
plot = df_text['Economic Activity'].value_counts().plot(kind='pie', title='Economic Activity', autopct='%1.1f%%'\
, labeldistance=None)
plot.set_ylabel('')
plot.legend(fancybox=True, framealpha=1, shadow=True, borderpad=1, bbox_to_anchor=(1.5, 0.75), loc='upper right',\
prop={'size': 8})
get_val_counts(df_text, 'Occupation', sort=sort_val_counts)
We will now visualise this data in a barchart. Please note that the 'Not required' option has been removed from the data for visualisation.
plotOccupation = df_text['Occupation'].value_counts().drop('Not required').plot(kind='barh')
plotOccupation.set_xlabel('Number of Records')
plotOccupation.set_ylabel('Occupation')
plotOccupation.set_title('Number of Records for Each Occupation')
get_val_counts(df_text, 'Industry', sort=sort_val_counts)
get_val_counts(df_text, 'Hours worked per week', sort=sort_val_counts)
get_val_counts(df_text, 'Approximated Social Grade', sort=sort_val_counts)
We are now going to explore how industry varies between regions.
ind_by_region = df_text.groupby(['Industry', 'Region'])['Industry'].count().drop('Not required').unstack('Region')
ind_by_region
We will now look at three different graphs to help visualise this data. The first two are stacked bar charts which both show the same data but witht the tables tranpsosed, followed by a 3D bar chart.
plot = ind_by_region.plot(kind='barh', stacked=True)
plot.set_xlabel('People')
plot = ind_by_region.transpose().plot(kind='barh', stacked=True)
plot.legend(title='Industry', prop={'size': 9}, bbox_to_anchor=(1.9, 0.65), loc='upper right')
plot.set_xlabel('People')
Whilst both of these bar charts display the same data, depending on what you are trying to visualise one can be much more useful than the other. For example if you want to see what the largest industry is in a given region the second graph is much more useful.Where as if you want to compare the total size of each industry then the first is much better.
three_dim_model(df_text,'Region','Industry')
Whilst it is difficult to read exact values, the 3d chart is still a good visualisation tool to be able to see all the data and be able to compare differnet regions or different industries to each other.
sg_by_occ = df_text.groupby(['Occupation', 'Approximated Social Grade'])['Occupation'].count().unstack('Approximated Social Grade')
sg_by_occ = sg_by_occ.drop('Not required').drop('Not required', axis=1)
sg_by_occ
plot = sg_by_occ.plot(kind='barh', stacked=True)
plot.set_xlabel('People')
plot.legend(title='Approximated Social Grade', prop={'size': 9}, bbox_to_anchor=(1, 0.65), loc='upper right')
three_dim_model(df_text,'Occupation','Approximated Social Grade')
We are now going to investigate the ditributions of relgions across the different regions.
region_to_religion = df_text.groupby(['Region', 'Religion'])['Region'].count()\
.unstack('Religion')
region_to_religion = region_to_religion.drop("Not required", axis=1)
region_to_religion
def drawRegRelPlot(bud,chri,hin,jew,mus,none,nonestated,oth,sikh):
plot_data = region_to_religion
#drops diffenet religions based on widget input
if(not bud):
plot_data = plot_data.drop("Buddhist", axis=1)
if(not chri):
plot_data = plot_data.drop("Christian", axis=1)
if(not hin):
plot_data = plot_data.drop("Hindu", axis=1)
if(not jew):
plot_data = plot_data.drop("Jewish", axis=1)
if(not mus):
plot_data = plot_data.drop("Muslim", axis=1)
if(not none):
plot_data = plot_data.drop("None", axis=1)
if(not nonestated):
plot_data = plot_data.drop("None stated", axis=1)
if(not oth):
plot_data = plot_data.drop("Other", axis=1)
if(not sikh):
plot_data = plot_data.drop("Sikh", axis=1)
plot = plot_data.plot(kind='barh', stacked=True)
plot.set_xlabel('People')
plot.legend(title='Religon', prop={'size': 9}, bbox_to_anchor=(1, 1), loc='upper right')
# create widgets
bud = widgets.Checkbox(value=True, description='Buddhist')
chri = widgets.Checkbox(value=True, description='Christian')
hin = widgets.Checkbox(value=True, description='Hindu')
jew = widgets.Checkbox(value=True, description='Jewish')
mus = widgets.Checkbox(value=True, description='Muslim')
none = widgets.Checkbox(value=True, description='None')
nonestated = widgets.Checkbox(value=True, description='None stated')
oth = widgets.Checkbox(value=True, description='Other')
sikh = widgets.Checkbox(value=True, description='Sikh')
# place widgets in container
container2 = widgets.GridBox([bud,chri,hin,jew,mus,none,nonestated,oth,sikh],
layout=widgets.Layout(grid_template_columns="repeat(3, 2in)"))
graph = widgets.interactive_output(drawRegRelPlot, {'bud' : bud, 'chri':chri,\
'hin':hin, 'jew':jew, 'mus':mus, 'none':none,\
'nonestated':nonestated, 'oth':oth, 'sikh':sikh})
display(container2,graph)
The use of widgets in this graph make it very easy to compare different data which would otherwise be difficult to compare due to the fact that Christianity is such a major proportion it makes some other data difficult to read.
We will now visualise this data using a map looking at the percentage of the population of each region for each religion. The religion can be changed using the drop down list abvoe the map.
# Convert to percentage of region
religion_perc = region_to_religion.copy()
for reg in religion_perc.index:
reg_size = len(df_text[df_text['Region']==reg])
religion_perc.loc[reg] = region_to_religion.loc[reg].map(lambda x: x / reg_size * 100 )
def generateReligionMap(religion):
data = religion_perc[religion]
mapl = folium.Map(location=[53,-1.6], tiles='cartodbpositron', zoom_start=6)
folium.Choropleth(
geo_data = '../data/NUTS1.geojson',
data = data,
key_on='feature.properties.nuts118nm',
name = 'choropleth',
fill_color = 'GnBu',
fill_opacity = 0.8,
line_opacity = 1,
legend_name="Percentage of population (%)"
).add_to(mapl)
display(mapl)
res = df_text['Religion'].unique()
res = [x for x in res if x != 'Not required']
interact(generateReligionMap, religion=res)
This map allows very good visualisation about where a certain religon may be more prominent within the population. For example when looking at the map for Buddhism it becomes very clear that London is by far the region with the highest proportion of budhists.
#Use numeric data and then textualize result
dfEconActive = df_num.query('`Economic Activity` < 5 and `Economic Activity` > 0').copy(deep=True)
dfEconActive = textualize_data(dfEconActive, text_rep)
GroupRegAndEcon = dfEconActive.groupby(['Region','Economic Activity'])['Region'].count().unstack()
GroupReg = dfEconActive.groupby(['Region']).size() .reset_index(name='counts')
GroupReg
plot = GroupRegAndEcon.plot(kind='barh', stacked=True)
plot.set_xlabel('People')
plot.legend(title='Economic status', prop={'size': 11}, bbox_to_anchor=(1.6, 0.7), loc='upper right')
The above graph and table show that whilst the number of economically active people in each area differs the proportion of each category is quite similar
Using the result from the query for economic activity we can now group by age to find economic activity by age.
groupAge = dfEconActive.groupby(['Age']).size().reset_index(name='counts')
groupAgeAndEcon = dfEconActive.groupby(['Age','Economic Activity'])['Age'].count().unstack()
groupAge
plot = groupAgeAndEcon.plot(kind='barh', stacked=True)
plot.set_xlabel('People')
We will now see if anybody in the data has said that they are a student but do not have an economic activity relating to being a student.
dfStudentOnBoth = df_num.query('Student == 1 and (`Economic Activity` == -9 or `Economic Activity` == 4 or `Economic Activity` == 6)')
dfStudents = df_num.query('Student == 1')
print("Number of individuals with student status of 1 and with economic activity related to student: ", len(dfStudentOnBoth))
print("Number of individuals with student status of 1: ", len(dfStudents))
print("\nDiscrepancies: ", len(dfStudents) - len(dfStudentOnBoth))
As can be seen there are no such discrepancies in the data.
dfEconAct4and6 = df_num.query('Student == 1 and (`Economic Activity` == 4 or `Economic Activity` == 6)').copy(deep=True)
groupHWPW = dfEconAct4and6.groupby(['Hours worked per week']).size().sort_index()
groupHWPW = textualize_data(groupHWPW, text_rep)
groupHWPW.to_frame()
plot = groupHWPW.drop('Not required').plot(kind='barh', title='Hours worked per week by students')
plot.set_xlabel('Students')
Students which answered 'Not required' to this question have been removed from the data but can easily be added back by removing the drop method call from the above cell.
The first optimization of the code I changed the methods that checked the data types of the columns. Rather than specifying the values for each and every column, I designed a program that looped through the columns and assumed the datatypes were a default type specified and the few columns that had a different type would be indicated to the program using a mapping of column to data type. This optimisation allows for the data to scale and if there is a column that is added that has a different data type to user could specify in one place and the program would still work. This allows for the function to be applied to different data sets as long as a different mapping was provided.
The next major optimisation that was added was the addition of a large dictionary that mapped values in the raw data into text. This addition allowed for all the methods to provide textual representations by looping through the dictionary mapping and changing out the values per column. This addition is used throughout the notebook to change the values to text so that the models can be read with ease. This addition allows for the functions to be applied to any data set as long as the dictionary with the text mappings are provided to the program. If the user wanted to change the text mapping for a column all they would need to do is change the value in the dictionary and it would be reflected everywhere. I chose to use a dictionary for the value mappings because python dictionaries are implemented with hashmaps. This allows for us to associate a name (column names) with a set of values and the values could be retrieved in O(1) time.
One of the most time consuming parts of analysing the data is cleansing the data. To improve efficiency and reduce the time taken it is reccomended that onces the data is loaded and cleansed to save a clean version of the data so that in the future this clean data can be loaded and the cleansing only needs to be performed again if the data changes or if additonal cleansing is added.
Removing rows from the data which contrain inadmissable values can be very time consuming however this process was made much fast by first checking the unique values in the column as if these are all admissable then there is no need to try to remove any rows. However this performance benefit is only seen if the data is mostly admissable.
The main focus of many of the optimisations was to improve the modularity of all of the code in the notebook. These changes included making moving the code into seperate modules and then making the functions more general so that they could be applied to other data sets by just changing the values passed into the function to avoid hard coding of plots and analysis.
The whole notebook has been written in a way such that it is reproducable. For example if the next uk census contained the same variables then the same notebook could be used simply changing the path to the new file containing the data.
The cleansing functionality is completely reusable and is designed to work with any dataframe as long as the correct arguements are passed in whcih describe the dataset. This is also true for the function which is designed to provide textual versions of the data.
There are some values in the dataset that have to be hard coded due to the fact that some values are specific to the dataset that is being used.
The data used for the UK NUTS1 boundaries comes from : https://data.gov.uk/dataset/37edc0ad-ffff-47c9-a01c-cb8d6123ec79/nuts-level-1-january-2018-ultra-generalised-clipped-boundaries-in-the-united-kingdom
This data has been modified to remove Scotland and Northern Ireland and some names have been changes to work with the location names in this dataset.
Some influence has been taken from https://towardsdatascience.com/choropleth-maps-with-folium-1a5b8bcdd392 for how to create a choropleth map.