2011 UK Census Analysis

Python

Data Analysis with Pandas


Introduction

For this project I will investigate a sample set of data from the 2011 UK census. Below is a summary of the analysis implemented:

  • All data is cleaned to ensure that it is readable, of the correct format and that all data is valid and as expected.
  • Descrptive analysis of the dataset is performed finding the total number of records in the dataset, the type of each variabale in the dataset and a list of all the values for each column in the data set and the frequency of each value (except Person ID as this is unique).
  • I have included the following plots under the relevant variable sections:
    • Bar chart showing the number of records per region
    • Bar chart showing the number of records for each occupation
    • Pie chart showing the distribution of the sample by age
    • Pie chart showing hte distrubution of the sample by economic activity
  • All analysis provides textual interpretations for the raw values contained in the dataset.
  • Tables have been produced which show the following:
    • The number of records in the sample for each region and industry combination
    • The number of records in the sample for each occupation and social grade combination
  • Pandas queries are used to investigate:
    • The number of economically active people per region
    • The number of economically active people by age
    • If there are any discrepancies between the student status and economic activity response
    • The number of hours worked per week for students
  • 3D plots are used to show:
    • The relationship between region and industry
    • The relationship between occupation and social grade
  • Some graphs use Ipywidgets to allow the user to choose which data to display. (Religon by Region)
  • Virtual environments have been used including a requirements.txt in the submission to allow the user to quickly install all packages required.
  • The performace of certain parts of the analysis have been investigated and recomendations have been included as to how to use the solutions efficiently.
  • Stacked bar charts have been used along side the 3D plots
  • Unit testing of cleaning functionality

I will now start the analysis of the dataset starting by importing the moduels that will be needed.

To install run pip3 install -r requirements.txt to install all the modules we used
In [1]:
import pandas as pd
pd.set_option('display.max_rows', 130) 
In [2]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('style.mplstyle')
In [3]:
import folium
In [4]:
from ipywidgets import interact, widgets
from IPython.display import display
In [5]:
import numpy as np
import os.path
In [6]:
from clean import clean_data # own module for cleaning data
from textualize import textualize_data # own module to replace numerical data with textual data
from analyse import get_val_counts # own module to analyse a dataframe
from models import three_dim_model

Below are the two variables which contain the paths to the raw data as well as a path to the cleaned data (if it has been created)

In [7]:
raw_path = '../data/census2011.csv'
clean_path = '../data/census2011_cleaned.csv'

Loading and cleaning the data

First define a function load_data(path) which loads raw data from a csv into a pandas dataframe.

In [8]:
def load_data(path):
    try:
        df = pd.read_csv(path)
        return df
    except IOError as e:
        print("There was an error reading data: " + str(e))
        return
In [9]:
df_raw = load_data(raw_path)
df_raw
Out[9]:
Person ID Region Residence Type Family Composition Population Base Sex Age Marital Status Student Country of Birth Health Ethnic Group Religion Economic Activity Occupation Industry Hours worked per week Approximated Social Grade
0 7394816 E12000001 H 2 1 2 6 2 2 1 2 1 2 5 8 2 -9 4
1 7394745 E12000001 H 5 1 1 4 1 2 1 1 1 2 1 8 6 4 3
2 7395066 E12000001 H 3 1 2 4 1 2 1 1 1 1 1 6 11 3 4
3 7395329 E12000001 H 3 1 2 2 1 2 1 2 1 2 1 7 7 3 2
4 7394712 E12000001 H 3 1 1 5 4 2 1 1 1 2 1 1 4 3 2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
569736 7946020 W92000004 H 1 1 1 5 1 2 1 4 1 9 1 8 8 3 3
569737 7944310 W92000004 H 3 1 1 3 1 2 1 2 1 1 1 7 4 3 4
569738 7945374 W92000004 H 3 1 1 1 1 1 1 1 1 2 -9 -9 -9 -9 -9
569739 7944768 W92000004 H 1 1 2 8 5 2 1 3 1 9 5 9 2 -9 4
569740 7944959 W92000004 H 2 1 2 2 2 2 1 2 1 1 1 7 4 1 4

569741 rows × 18 columns

Now that the data has been loaded we need to clean this data so that it can be used for analysis. This will be done using the clean_data method defined inside of clean.py. This file contains a numebr of methods which can be used to clean data as well a function which will perform all the different cleaning methods to the data. All these methods take arguements which allows them to be used for different datasets.

First we will declare some variables which are needed in order to clean the data.

In [10]:
expected_types = {'Region': 'str', 'Residence Type': 'str'} # the expected data types of the given columns
In [11]:
admissible_values = {
    "Region": ["E12000001", "E12000002", "E12000003", "E12000004", "E12000005", "E12000006", "E12000007", "E12000008", "E12000009", "W92000004"],
    "Residence Type": ["C", "H"],
    "Family Composition": [1, 2, 3, 4, 5, 6, -9],
    "Population Base": [1, 2, 3],
    "Sex": [1, 2],
    "Age": [1, 2, 3, 4, 5, 6, 7, 8],
    "Marital Status": [1, 2, 3, 4, 5],
    "Student": [1, 2],
    "Country of Birth": [1, 2, -9],
    "Health": [1, 2, 3, 4, 5, -9],
    "Ethnic Group": [1, 2, 3, 4, 5, -9],
    "Religion": [1, 2, 3, 4, 5, 6, 7, 8, 9, -9],
    "Economic Activity": [1, 2, 3, 4, 5, 6, 7, 8, 9, -9],
    "Occupation": [1, 2, 3, 4, 5, 6, 7, 8, 9, -9],
    "Industry": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, -9],
    "Hours worked per week": [1, 2, 3, 4, -9],
    "Approximated Social Grade": [1, 2, 3, 4, -9]
} # The accepted values for each column in the dataset (If the data is not restricted then the column can simply be left out)

Now we need to call the function using these variables and the raw data that has been loaded.

The data has be cleaned in the following ways:

  • The data type of each column is compared to the data type which is expected. If there is not a match then an attempt is made to convert the data. If this conversion cannot be done then nothing is returned and a message will be shown

  • The values in every column are checked to make sure that it is an admissible value. Any row containing an inadmissable value is removed from the dataframe.

  • Duplicate data is removed:

    • If there are two identical rows in the data then one will be kept.
    • If there are two rows with the same index (Person ID) but with any difference in the values being stored then both rows are removed.

The index of the dataframe will also be changed to the specified column.

In [12]:
df_clean = clean_data(df_raw, expected_types, admissible_values, 'Person ID')
df_clean.to_csv(clean_path);

The cleaned data is now stored in the clean_data variable and has also been saved for future use to the path specified.

Below shows the records that have been removed from the data:

In [13]:
pd.concat([df_raw.set_index('Person ID'),df_clean]).drop_duplicates(keep=False)
Out[13]:
Region Residence Type Family Composition Population Base Sex Age Marital Status Student Country of Birth Health Ethnic Group Religion Economic Activity Occupation Industry Hours worked per week Approximated Social Grade
Person ID
7959238 E12000008 H 2 1 1 6 2 2 2 2 3 4 1 8 3 3 4
7959238 W92000004 H 2 1 1 6 2 2 2 2 3 4 1 8 3 3 4

As can be seen by the above data the only issue in the original data is a single duplicated Person ID. Both rows are removed as they have differnet regions so it is not posssible to determine which is correct or if the personID is in fact the issue.

Textual representaiton of Data

Now that the data has been loaded in and cleaned so that it is suitable for analysis, in order for the results of any analysis to be meaningful we need to replace the categoric codes that are stored in the data set with a textual description. This is done using the textualize_data function defined inside of textualize.py. This function takes a map of column names onto a map of categorical code to textual representation allowing it to be used for different datasets.

We first define a map containing the textual representation called text_rep. This mapping allows for the the data to be reused on different areas of the notebook across different datasets, provided the mapping is given. We then call the textualize_data on the clean data and store it in the df_text variable.

In [14]:
text_rep = {
    "Region": {
      "E12000001": "North East",
      "E12000002": "North West",
      "E12000003": "Yorkshire and the Humber",
      "E12000004": "East Midlands",
      "E12000005": "West Midlands",
      "E12000006": "East of England",
      "E12000007": "London",
      "E12000008": "South East",
      "E12000009": "South West",
      "W92000004": "Wales"
    },
    "Residence Type": {
      "C": "Resident in a communal establishment",
      "H": "Not resident in a communal establishment"
    },
    "Family Composition": {
      1: "Not in a family",
      2: "Married/same-sex civil partnership couple family",
      3: "Cohabiting couple family",
      4: "Lone parent family (male head)",
      5: "Lone parent family (female head)",
      6: "Other related family",
      -9: "Not required"
    },
    "Population Base": {
      1: "Usual resident",
      2: "Student living away from home",
      3: "Short-term resident"
    },
    "Sex": {
      1: "Male",
      2: "Female"
    },
    "Age": {
      1: "0 to 15",
      2: "16 to 24",
      3: "25 to 34",
      4: "35 to 44",
      5: "45 to 54",
      6: "55 to 64",
      7: "65 to 74",
      8: "75 and over"
    },
    "Marital Status": {
      1: "Single (never married)",
      2: "Married",
      3: "Seperated but still legally married",
      4: "Divorced",
      5: "Widowed"
    },

    "Student": {
      1: "Yes",
      2: "No"
    },
    "Country of Birth": {
      1: "UK",
      2: "Non UK",
      -9: "Not required"
    },
    "Health": {
      1: "Very good",
      2: "Good",
      3: "Fair",
      4: "Bad",
      5: "Very bad",
      -9: "Not required"
    },
    "Ethnic Group": {
      1: "White",
      2: "Mixed",
      3: "Asian",
      4: "Black",
      5: "Chinese or other group",
      -9: "Not required"},

    "Religion": {
      1: "None",
      2: "Christian",
      3: "Buddhist",
      4: "Hindu",
      5: "Jewish",
      6: "Muslim",
      7: "Sikh",
      8: "Other",
      9: "None stated",
      -9: "Not required"
    },
    "Economic Activity": {
      1: "Economically active: Employeed",
      2: "Economically active: Self-employeed",
      3: "Economically active: Unemployed",
      4: "Economically active: Full-time student",
      5: "Economically inactive: Retired",
      6: "Economically inactive: Student",
      7: "Economically inactive: Looking after home or family",
      8: "Economically inactive: Long-term sick or disabled",
      9: "Economically inactive: Other",
      -9: "Not required"
    },

    "Occupation": {
      1: "Managers, Directors and Senior Officials",
      2: "Professional Occupations",
      3: "Associate Professional and Technical Occupations",
      4: "Administrative and Secretarial Occupations",
      5: "Skilled Trades Occupations",
      6: "Caring, Leisure and Other Service Occupations",
      7: "Sales and Customer Service Occupations",
      8: "Process, Plant and Machine Operatives",
      9: "Elementary Occupations",
      -9: "Not required"
    },

    "Industry": {
      1: "Agriculture, forestry and fishing",
      2: "Mining and quarrying; Manufacturing",
      3: "Construction",
      4: "Wholesale and retail trade",
      5: "Accommodation and food service activities",
      6: "Transport and storage; Information and communication",
      7: "Financial and insurance activities",
      8: "Real estate; Professional, scientific and technical; Administrative and support",
      9: "Public administration and defence; compulsory social security",
      10: "Education",
      11: "Human health and social work activities",
      12: "Other community, social and personal service activities",
      -9: "Not required"
    },

    "Hours worked per week": {
      1: "<=15",
      2: "16-30",
      3: "31-48",
      4: ">=48",
      -9: "Not required"
    },
    "Approximated Social Grade": {
      1: "AB",
      2: "C1",
      3: "C2",
      4: "DE",
      -9: "Not required"
    }
}
In [15]:
df_num  = df_clean.copy()
df_text = textualize_data(df_clean, text_rep)

From this point on the variables df_text and df_num can be used to analyse the textual/numeric data respectively.

To save time and improve optimisation these variables should be used after the data has been loaded as reading from the CSV is quite time consuming and there is no reason that the data needs to be reloaded.

Descriptive Analysis of the Dataset

First we will look at the size of the dataset. Using len() on the dataframe we can see that there are 569739 records in the clean dataset.

In [16]:
len(df_text)
Out[16]:
569739

Data Types

The below table shows the datatype of each variable in the dataset and its datatype.

In [17]:
df_num.dtypes.to_frame('Datatype')
Out[17]:
Datatype
Region object
Residence Type object
Family Composition int64
Population Base int64
Sex int64
Age int64
Marital Status int64
Student int64
Country of Birth int64
Health int64
Ethnic Group int64
Religion int64
Economic Activity int64
Occupation int64
Industry int64
Hours worked per week int64
Approximated Social Grade int64

The Region and Residence Type are both strings however these are stored as generic objects as it would not be possible to store arbritarily sized strings.Pandas cannot guarantee that the data will in fact be a string so does not specify string. All other variables in the dataset are integers. All the data in the data set is categoric and the values stored represent a certain category for each variable.

Variables in data

Each of the following subsections shows different values for each variable as well as how many of each value is in the dataset. By modifying the sort_val_counts variable below you can set wether or not the value counts should be sorted into descending order.

In [18]:
sort_val_counts = True

Region

In [19]:
get_val_counts(df_text, 'Region', sort=sort_val_counts)
Out[19]:
Count
South East 88083
London 83582
North West 71436
East of England 59411
West Midlands 56875
South West 53774
Yorkshire and the Humber 53471
East Midlands 45782
Wales 30976
North East 26349
In [20]:
print("Data type: " + str(df_text['Region'].dtype))
Data type: object

We will also use a bar chart and a map to visualise this distribution of population across the different regions. The map is very useful in comparing which areas are most populus however the bar chart is better to compare actual values.

In [21]:
Region_Counts = df_text['Region'].value_counts()
plotRegion = Region_Counts.plot(kind='barh')
plotRegion.set_xlabel('Number of Records', fontsize=15)
plotRegion.set_ylabel('Region', fontsize=15)
plotRegion.set_title('Number of Records for Each Region')
Out[21]:
Text(0.5, 1.0, 'Number of Records for Each Region')
In [22]:
regionMap = folium.Map(location=[53,-1.6], tiles='cartodbpositron', zoom_start=6)
folium.Choropleth(
    geo_data = '../data/NUTS1.geojson',
    data = Region_Counts,
    key_on='feature.properties.nuts118nm',
    name = 'choropleth',
    fill_color = 'GnBu',
    fill_opacity = 0.8,
    line_opacity = 1
).add_to(regionMap)
regionMap
Out[22]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Residence Type

In [23]:
get_val_counts(df_text, 'Residence Type', sort=sort_val_counts)
Out[23]:
Count
Not resident in a communal establishment 559085
Resident in a communal establishment 10654

Family Composition

In [24]:
get_val_counts(df_text, 'Family Composition', sort=sort_val_counts)
Out[24]:
Count
Married/same-sex civil partnership couple family 300960
Not in a family 96690
Cohabiting couple family 72641
Lone parent family (female head) 64519
Not required 18851
Lone parent family (male head) 9848
Other related family 6230

Population Base

In [25]:
get_val_counts(df_text, 'Population Base', sort=sort_val_counts)
Out[25]:
Count
Usual resident 561038
Student living away from home 6730
Short-term resident 1971

Sex

In [26]:
get_val_counts(df_text, 'Sex', sort=sort_val_counts)
Out[26]:
Count
Female 289172
Male 280567

Age

In [27]:
get_val_counts(df_text, 'Age', sort=sort_val_counts)
Out[27]:
Count
0 to 15 106832
35 to 44 78641
45 to 54 77388
25 to 34 75948
16 to 24 72785
55 to 64 65664
65 to 74 48777
75 and over 43704

The pie chart below shows the distribution of ages in the population. It is worth noting when looking at this chart that the lowest age range is larger than the other ranges (excluding 75+).

In [ ]:
 
In [28]:
plot = df_text['Age'].value_counts().sort_index().plot(kind='pie', title='Age distribution', autopct='%1.1f%%')
plot.set_ylabel('')
Out[28]:
Text(0, 0.5, '')

Marital Status

In [29]:
get_val_counts(df_text, 'Marital Status', sort=sort_val_counts)
Out[29]:
Count
Single (never married) 270999
Married 214178
Divorced 40713
Widowed 31898
Seperated but still legally married 11951
In [30]:
print("Data type: " + str(df_text['Residence Type'].dtype))
Data type: object

Student

In [31]:
get_val_counts(df_text, 'Student', sort=sort_val_counts)
Out[31]:
Count
No 443202
Yes 126537

Country of Birth

In [32]:
get_val_counts(df_text, 'Country of Birth', sort=sort_val_counts)
Out[32]:
Count
UK 485645
Non UK 77290
Not required 6804

Health

In [33]:
get_val_counts(df_text, 'Health', sort=sort_val_counts)
Out[33]:
Count
Very good 264971
Good 191742
Fair 74480
Bad 24558
Very bad 7184
Not required 6804

Ethnic Group

In [34]:
get_val_counts(df_text, 'Ethnic Group', sort=sort_val_counts)
Out[34]:
Count
White 483477
Asian 42710
Black 18786
Mixed 12209
Not required 6804
Chinese or other group 5753

Religion

In [35]:
get_val_counts(df_text, 'Religion', sort=sort_val_counts)
Out[35]:
Count
Christian 333481
None 141658
None stated 40613
Muslim 27240
Hindu 8212
Not required 6804
Sikh 4215
Jewish 2572
Buddhist 2538
Other 2406

Economic Activity

In [36]:
get_val_counts(df_text, 'Economic Activity', sort=sort_val_counts)
Out[36]:
Count
Economically active: Employeed 216023
Not required 112618
Economically inactive: Retired 97480
Economically active: Self-employeed 40632
Economically inactive: Student 24756
Economically active: Unemployed 18109
Economically inactive: Long-term sick or disabled 17991
Economically inactive: Looking after home or family 17945
Economically active: Full-time student 14117
Economically inactive: Other 10068
In [37]:
plot = df_text['Economic Activity'].value_counts().plot(kind='pie', title='Economic Activity', autopct='%1.1f%%'\
                                                                   , labeldistance=None)
plot.set_ylabel('')
plot.legend(fancybox=True, framealpha=1, shadow=True, borderpad=1, bbox_to_anchor=(1.5, 0.75), loc='upper right',\
           prop={'size': 8})
Out[37]:
<matplotlib.legend.Legend at 0x7fe78b4e6e10>

Occupation

In [38]:
get_val_counts(df_text, 'Occupation', sort=sort_val_counts)
Out[38]:
Count
Not required 149984
Professional Occupations 64111
Elementary Occupations 58483
Administrative and Secretarial Occupations 53254
Skilled Trades Occupations 48546
Associate Professional and Technical Occupations 44937
Managers, Directors and Senior Officials 39788
Sales and Customer Service Occupations 38523
Caring, Leisure and Other Service Occupations 37297
Process, Plant and Machine Operatives 34816

We will now visualise this data in a barchart. Please note that the 'Not required' option has been removed from the data for visualisation.

In [39]:
plotOccupation = df_text['Occupation'].value_counts().drop('Not required').plot(kind='barh')
plotOccupation.set_xlabel('Number of Records')
plotOccupation.set_ylabel('Occupation')
plotOccupation.set_title('Number of Records for Each Occupation')
Out[39]:
Text(0.5, 1.0, 'Number of Records for Each Occupation')

Industry

In [40]:
get_val_counts(df_text, 'Industry', sort=sort_val_counts)
Out[40]:
Count
Not required 149984
Wholesale and retail trade 68878
Mining and quarrying; Manufacturing 53433
Real estate; Professional, scientific and technical; Administrative and support 49960
Human health and social work activities 49345
Education 40560
Transport and storage; Information and communication 35240
Construction 30706
Accommodation and food service activities 25736
Public administration and defence; compulsory social security 24908
Other community, social and personal service activities 20256
Financial and insurance activities 16776
Agriculture, forestry and fishing 3957

Hours worked per week

In [41]:
get_val_counts(df_text, 'Hours worked per week', sort=sort_val_counts)
Out[41]:
Count
Not required 302321
31-48 153936
16-30 52133
>=48 35573
<=15 25776

Approximated Social Grade

In [42]:
get_val_counts(df_text, 'Approximated Social Grade', sort=sort_val_counts)
Out[42]:
Count
C1 159642
Not required 124103
DE 123738
AB 82320
C2 79936

Distribution of industry by region

We are now going to explore how industry varies between regions.

In [43]:
ind_by_region = df_text.groupby(['Industry', 'Region'])['Industry'].count().drop('Not required').unstack('Region')
ind_by_region
Out[43]:
Region East Midlands East of England London North East North West South East South West Wales West Midlands Yorkshire and the Humber
Industry
Accommodation and food service activities 1936 2336 4054 1300 3355 3566 2712 1641 2281 2555
Agriculture, forestry and fishing 429 616 55 132 357 510 697 403 396 362
Construction 2500 3508 3700 1574 3778 4895 3033 1730 2960 3028
Education 3280 4291 5373 1836 4890 6593 4039 2313 3984 3961
Financial and insurance activities 844 2199 4016 524 1597 3059 1510 594 1149 1284
Human health and social work activities 3952 4977 6109 2524 6764 7344 5085 2887 4848 4855
Mining and quarrying; Manufacturing 6057 5302 3266 2851 7726 6651 5012 3504 7108 5956
Other community, social and personal service activities 1414 2111 3735 848 2299 3316 1977 992 1828 1736
Public administration and defence; compulsory social security 1751 2514 3015 1498 3096 4145 2973 1714 2061 2141
Real estate; Professional, scientific and technical; Administrative and support 3299 5465 9968 1883 5822 8708 4475 2010 4284 4046
Transport and storage; Information and communication 2518 3841 6570 1438 3981 6502 2961 1500 3141 2788
Wholesale and retail trade 6133 7503 8204 3087 9016 10548 6899 3751 7067 6670

We will now look at three different graphs to help visualise this data. The first two are stacked bar charts which both show the same data but witht the tables tranpsosed, followed by a 3D bar chart.

In [44]:
plot = ind_by_region.plot(kind='barh', stacked=True)
plot.set_xlabel('People')
Out[44]:
Text(0.5, 0, 'People')
In [45]:
plot = ind_by_region.transpose().plot(kind='barh', stacked=True)
plot.legend(title='Industry', prop={'size': 9}, bbox_to_anchor=(1.9, 0.65), loc='upper right')
plot.set_xlabel('People')
Out[45]:
Text(0.5, 0, 'People')

Whilst both of these bar charts display the same data, depending on what you are trying to visualise one can be much more useful than the other. For example if you want to see what the largest industry is in a given region the second graph is much more useful.Where as if you want to compare the total size of each industry then the first is much better.

In [46]:
three_dim_model(df_text,'Region','Industry')

Whilst it is difficult to read exact values, the 3d chart is still a good visualisation tool to be able to see all the data and be able to compare differnet regions or different industries to each other.

Relationship between occupation and social grade

In [47]:
sg_by_occ = df_text.groupby(['Occupation', 'Approximated Social Grade'])['Occupation'].count().unstack('Approximated Social Grade')
sg_by_occ = sg_by_occ.drop('Not required').drop('Not required', axis=1)
sg_by_occ
Out[47]:
Approximated Social Grade AB C1 C2 DE
Occupation
Administrative and Secretarial Occupations 3000 44922 2353 2252
Associate Professional and Technical Occupations 7050 35435 647 986
Caring, Leisure and Other Service Occupations 1061 6343 15555 13860
Elementary Occupations 902 7010 6500 42433
Managers, Directors and Senior Officials 19190 18555 584 967
Process, Plant and Machine Operatives 413 1719 11157 21086
Professional Occupations 48104 13223 891 1009
Sales and Customer Service Occupations 964 12184 2997 21347
Skilled Trades Occupations 585 2464 37190 7629
In [48]:
plot = sg_by_occ.plot(kind='barh', stacked=True)
plot.set_xlabel('People')
plot.legend(title='Approximated Social Grade', prop={'size': 9}, bbox_to_anchor=(1, 0.65), loc='upper right')
Out[48]:
<matplotlib.legend.Legend at 0x7fe77af8dd90>
In [49]:
three_dim_model(df_text,'Occupation','Approximated Social Grade')

Relationship between region and religion

We are now going to investigate the ditributions of relgions across the different regions.

In [50]:
region_to_religion = df_text.groupby(['Region', 'Religion'])['Region'].count()\
                    .unstack('Religion')

region_to_religion = region_to_religion.drop("Not required", axis=1)
region_to_religion
Out[50]:
Religion Buddhist Christian Hindu Jewish Muslim None None stated Other Sikh
Region
East Midlands 139 26733 896 29 1386 12399 3074 174 429
East of England 246 35051 546 323 1476 16374 4205 240 185
London 819 39992 4136 1498 10316 17062 7045 451 1266
North East 65 17537 59 50 460 6176 1630 77 47
North West 214 47316 388 298 3602 14168 4436 185 80
South East 447 52165 906 164 1977 23798 6334 419 547
South West 189 31942 172 61 496 15638 4187 305 62
Wales 103 17612 109 17 442 9860 2358 116 24
West Midlands 158 33727 726 43 3772 12494 3735 275 1334
Yorkshire and the Humber 158 31406 274 89 3313 13689 3609 164 241
In [51]:
def drawRegRelPlot(bud,chri,hin,jew,mus,none,nonestated,oth,sikh):
    plot_data = region_to_religion
    #drops diffenet religions based on widget input
    if(not bud):
        plot_data = plot_data.drop("Buddhist", axis=1)
    if(not chri):
        plot_data = plot_data.drop("Christian", axis=1)
    if(not hin):
        plot_data = plot_data.drop("Hindu", axis=1)
    if(not jew):
        plot_data = plot_data.drop("Jewish", axis=1)
    if(not mus):
        plot_data = plot_data.drop("Muslim", axis=1)
    if(not none):
        plot_data = plot_data.drop("None", axis=1)
    if(not nonestated):
        plot_data = plot_data.drop("None stated", axis=1)
    if(not oth):
        plot_data = plot_data.drop("Other", axis=1)
    if(not sikh):
        plot_data = plot_data.drop("Sikh", axis=1)
        
        
    plot = plot_data.plot(kind='barh', stacked=True)
    
        
    plot.set_xlabel('People')
    plot.legend(title='Religon', prop={'size': 9}, bbox_to_anchor=(1, 1), loc='upper right')
    
In [52]:
# create widgets
bud = widgets.Checkbox(value=True, description='Buddhist')
chri = widgets.Checkbox(value=True, description='Christian')
hin = widgets.Checkbox(value=True, description='Hindu')
jew = widgets.Checkbox(value=True, description='Jewish')
mus = widgets.Checkbox(value=True, description='Muslim')
none = widgets.Checkbox(value=True, description='None')
nonestated = widgets.Checkbox(value=True, description='None stated')
oth = widgets.Checkbox(value=True, description='Other')
sikh = widgets.Checkbox(value=True, description='Sikh')

# place widgets in container
container2 = widgets.GridBox([bud,chri,hin,jew,mus,none,nonestated,oth,sikh],
                            layout=widgets.Layout(grid_template_columns="repeat(3, 2in)"))

graph = widgets.interactive_output(drawRegRelPlot, {'bud' : bud, 'chri':chri,\
                                   'hin':hin, 'jew':jew, 'mus':mus, 'none':none,\
                                   'nonestated':nonestated, 'oth':oth, 'sikh':sikh})
display(container2,graph)

The use of widgets in this graph make it very easy to compare different data which would otherwise be difficult to compare due to the fact that Christianity is such a major proportion it makes some other data difficult to read.

We will now visualise this data using a map looking at the percentage of the population of each region for each religion. The religion can be changed using the drop down list abvoe the map.

In [53]:
# Convert to percentage of region
religion_perc = region_to_religion.copy()
for reg in religion_perc.index:
    reg_size = len(df_text[df_text['Region']==reg])
    religion_perc.loc[reg] = region_to_religion.loc[reg].map(lambda x: x / reg_size * 100 )
In [54]:
def generateReligionMap(religion):
    data = religion_perc[religion]
    mapl = folium.Map(location=[53,-1.6], tiles='cartodbpositron', zoom_start=6)
    folium.Choropleth(
        geo_data = '../data/NUTS1.geojson',
        data = data,
        key_on='feature.properties.nuts118nm',
        name = 'choropleth',
        fill_color = 'GnBu',
        fill_opacity = 0.8,
        line_opacity = 1,
        legend_name="Percentage of population (%)"
    ).add_to(mapl)
    display(mapl)
In [55]:
res = df_text['Religion'].unique()
res = [x for x in res if x != 'Not required']
In [56]:
interact(generateReligionMap, religion=res)
Out[56]:
<function __main__.generateReligionMap(religion)>

This map allows very good visualisation about where a certain religon may be more prominent within the population. For example when looking at the map for Buddhism it becomes very clear that London is by far the region with the highest proportion of budhists.


Queries

Economic activity by region

In [57]:
#Use numeric data and then textualize result
dfEconActive = df_num.query('`Economic Activity` < 5 and `Economic Activity` > 0').copy(deep=True)
dfEconActive = textualize_data(dfEconActive, text_rep)

GroupRegAndEcon = dfEconActive.groupby(['Region','Economic Activity'])['Region'].count().unstack()
GroupReg = dfEconActive.groupby(['Region']).size() .reset_index(name='counts')

GroupReg
Out[57]:
Region counts
0 East Midlands 23106
1 East of England 30568
2 London 44454
3 North East 12897
4 North West 35204
5 South East 45550
6 South West 27453
7 Wales 14876
8 West Midlands 27930
9 Yorkshire and the Humber 26843
In [58]:
plot = GroupRegAndEcon.plot(kind='barh', stacked=True)
plot.set_xlabel('People')
plot.legend(title='Economic status', prop={'size': 11}, bbox_to_anchor=(1.6, 0.7), loc='upper right')
Out[58]:
<matplotlib.legend.Legend at 0x7fe77a9221d0>

The above graph and table show that whilst the number of economically active people in each area differs the proportion of each category is quite similar

Ecconomic activity by age

Using the result from the query for economic activity we can now group by age to find economic activity by age.

In [59]:
groupAge = dfEconActive.groupby(['Age']).size().reset_index(name='counts')
groupAgeAndEcon = dfEconActive.groupby(['Age','Economic Activity'])['Age'].count().unstack()
groupAge
Out[59]:
Age counts
0 16 to 24 41663
1 25 to 34 64326
2 35 to 44 67050
3 45 to 54 65736
4 55 to 64 40582
5 65 to 74 8022
6 75 and over 1502
In [60]:
plot = groupAgeAndEcon.plot(kind='barh', stacked=True)
plot.set_xlabel('People')
Out[60]:
Text(0.5, 0, 'People')

Discrepencies in student economic activity

We will now see if anybody in the data has said that they are a student but do not have an economic activity relating to being a student.

In [61]:
dfStudentOnBoth = df_num.query('Student == 1 and (`Economic Activity` == -9 or `Economic Activity` == 4 or `Economic Activity` == 6)')
dfStudents = df_num.query('Student == 1')

print("Number of individuals with student status of 1 and with economic activity related to student: ", len(dfStudentOnBoth))
print("Number of individuals with student status of 1: ", len(dfStudents))
print("\nDiscrepancies: ", len(dfStudents) - len(dfStudentOnBoth))
Number of individuals with student status of 1 and with economic activity related to student:  126537
Number of individuals with student status of 1:  126537

Discrepancies:  0

As can be seen there are no such discrepancies in the data.

Hours worked per week by students

In [62]:
dfEconAct4and6 = df_num.query('Student == 1 and (`Economic Activity` == 4 or `Economic Activity` == 6)').copy(deep=True)
groupHWPW = dfEconAct4and6.groupby(['Hours worked per week']).size().sort_index()
groupHWPW = textualize_data(groupHWPW, text_rep)
groupHWPW.to_frame()
Out[62]:
0
Hours worked per week
Not required 27192
<=15 6465
16-30 2334
31-48 1683
>=48 281
In [63]:
plot = groupHWPW.drop('Not required').plot(kind='barh', title='Hours worked per week by students')
plot.set_xlabel('Students')
Out[63]:
Text(0.5, 0, 'Students')

Students which answered 'Not required' to this question have been removed from the data but can easily be added back by removing the drop method call from the above cell.

Optimisations

The first optimization of the code I changed the methods that checked the data types of the columns. Rather than specifying the values for each and every column, I designed a program that looped through the columns and assumed the datatypes were a default type specified and the few columns that had a different type would be indicated to the program using a mapping of column to data type. This optimisation allows for the data to scale and if there is a column that is added that has a different data type to user could specify in one place and the program would still work. This allows for the function to be applied to different data sets as long as a different mapping was provided.

The next major optimisation that was added was the addition of a large dictionary that mapped values in the raw data into text. This addition allowed for all the methods to provide textual representations by looping through the dictionary mapping and changing out the values per column. This addition is used throughout the notebook to change the values to text so that the models can be read with ease. This addition allows for the functions to be applied to any data set as long as the dictionary with the text mappings are provided to the program. If the user wanted to change the text mapping for a column all they would need to do is change the value in the dictionary and it would be reflected everywhere. I chose to use a dictionary for the value mappings because python dictionaries are implemented with hashmaps. This allows for us to associate a name (column names) with a set of values and the values could be retrieved in O(1) time.

One of the most time consuming parts of analysing the data is cleansing the data. To improve efficiency and reduce the time taken it is reccomended that onces the data is loaded and cleansed to save a clean version of the data so that in the future this clean data can be loaded and the cleansing only needs to be performed again if the data changes or if additonal cleansing is added.

Removing rows from the data which contrain inadmissable values can be very time consuming however this process was made much fast by first checking the unique values in the column as if these are all admissable then there is no need to try to remove any rows. However this performance benefit is only seen if the data is mostly admissable.

The main focus of many of the optimisations was to improve the modularity of all of the code in the notebook. These changes included making moving the code into seperate modules and then making the functions more general so that they could be applied to other data sets by just changing the values passed into the function to avoid hard coding of plots and analysis.

Reproducability and Reusability

The whole notebook has been written in a way such that it is reproducable. For example if the next uk census contained the same variables then the same notebook could be used simply changing the path to the new file containing the data.

The cleansing functionality is completely reusable and is designed to work with any dataframe as long as the correct arguements are passed in whcih describe the dataset. This is also true for the function which is designed to provide textual versions of the data.

There are some values in the dataset that have to be hard coded due to the fact that some values are specific to the dataset that is being used.

Summary of Provenance

The data used for the UK NUTS1 boundaries comes from : https://data.gov.uk/dataset/37edc0ad-ffff-47c9-a01c-cb8d6123ec79/nuts-level-1-january-2018-ultra-generalised-clipped-boundaries-in-the-united-kingdom

This data has been modified to remove Scotland and Northern Ireland and some names have been changes to work with the location names in this dataset.

Some influence has been taken from https://towardsdatascience.com/choropleth-maps-with-folium-1a5b8bcdd392 for how to create a choropleth map.

In [ ]: