2011 UK Census Analysis¶

Python¶

Data Analysis with Pandas¶

Introduction¶

For this project I will investigate a sample set of data from the 2011 UK census. Below is a summary of the analysis implemented:

All data is cleaned to ensure that it is readable, of the correct format and that all data is valid and as expected.
Descrptive analysis of the dataset is performed finding the total number of records in the dataset, the type of each variabale in the dataset and a list of all the values for each column in the data set and the frequency of each value (except Person ID as this is unique).
I have included the following plots under the relevant variable sections:
- Bar chart showing the number of records per region
- Bar chart showing the number of records for each occupation
- Pie chart showing the distribution of the sample by age
- Pie chart showing hte distrubution of the sample by economic activity
All analysis provides textual interpretations for the raw values contained in the dataset.
Tables have been produced which show the following:
- The number of records in the sample for each region and industry combination
- The number of records in the sample for each occupation and social grade combination
Pandas queries are used to investigate:
- The number of economically active people per region
- The number of economically active people by age
- If there are any discrepancies between the student status and economic activity response
- The number of hours worked per week for students
3D plots are used to show:
- The relationship between region and industry
- The relationship between occupation and social grade
Some graphs use Ipywidgets to allow the user to choose which data to display. (Religon by Region)
Virtual environments have been used including a requirements.txt in the submission to allow the user to quickly install all packages required.
The performace of certain parts of the analysis have been investigated and recomendations have been included as to how to use the solutions efficiently.
Stacked bar charts have been used along side the 3D plots
Unit testing of cleaning functionality

I will now start the analysis of the dataset starting by importing the moduels that will be needed.

To install run pip3 install -r requirements.txt to install all the modules we used¶

import pandas as pd
pd.set_option('display.max_rows', 130)

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('style.mplstyle')

import folium

from ipywidgets import interact, widgets
from IPython.display import display

import numpy as np
import os.path

from clean import clean_data # own module for cleaning data
from textualize import textualize_data # own module to replace numerical data with textual data
from analyse import get_val_counts # own module to analyse a dataframe
from models import three_dim_model

Below are the two variables which contain the paths to the raw data as well as a path to the cleaned data (if it has been created)

raw_path = '../data/census2011.csv'
clean_path = '../data/census2011_cleaned.csv'

Loading and cleaning the data¶

First define a function load_data(path) which loads raw data from a csv into a pandas dataframe.

def load_data(path):
    try:
        df = pd.read_csv(path)
        return df
    except IOError as e:
        print("There was an error reading data: " + str(e))
        return

df_raw = load_data(raw_path)
df_raw

Now that the data has been loaded we need to clean this data so that it can be used for analysis. This will be done using the clean_data method defined inside of clean.py. This file contains a numebr of methods which can be used to clean data as well a function which will perform all the different cleaning methods to the data. All these methods take arguements which allows them to be used for different datasets.

First we will declare some variables which are needed in order to clean the data.

expected_types = {'Region': 'str', 'Residence Type': 'str'} # the expected data types of the given columns

admissible_values = {
    "Region": ["E12000001", "E12000002", "E12000003", "E12000004", "E12000005", "E12000006", "E12000007", "E12000008", "E12000009", "W92000004"],
    "Residence Type": ["C", "H"],
    "Family Composition": [1, 2, 3, 4, 5, 6, -9],
    "Population Base": [1, 2, 3],
    "Sex": [1, 2],
    "Age": [1, 2, 3, 4, 5, 6, 7, 8],
    "Marital Status": [1, 2, 3, 4, 5],
    "Student": [1, 2],
    "Country of Birth": [1, 2, -9],
    "Health": [1, 2, 3, 4, 5, -9],
    "Ethnic Group": [1, 2, 3, 4, 5, -9],
    "Religion": [1, 2, 3, 4, 5, 6, 7, 8, 9, -9],
    "Economic Activity": [1, 2, 3, 4, 5, 6, 7, 8, 9, -9],
    "Occupation": [1, 2, 3, 4, 5, 6, 7, 8, 9, -9],
    "Industry": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, -9],
    "Hours worked per week": [1, 2, 3, 4, -9],
    "Approximated Social Grade": [1, 2, 3, 4, -9]
} # The accepted values for each column in the dataset (If the data is not restricted then the column can simply be left out)

Now we need to call the function using these variables and the raw data that has been loaded.

The data has be cleaned in the following ways:

The data type of each column is compared to the data type which is expected. If there is not a match then an attempt is made to convert the data. If this conversion cannot be done then nothing is returned and a message will be shown
The values in every column are checked to make sure that it is an admissible value. Any row containing an inadmissable value is removed from the dataframe.
Duplicate data is removed:
- If there are two identical rows in the data then one will be kept.
- If there are two rows with the same index (Person ID) but with any difference in the values being stored then both rows are removed.

The index of the dataframe will also be changed to the specified column.

df_clean = clean_data(df_raw, expected_types, admissible_values, 'Person ID')
df_clean.to_csv(clean_path);

The cleaned data is now stored in the clean_data variable and has also been saved for future use to the path specified.

Below shows the records that have been removed from the data:

pd.concat([df_raw.set_index('Person ID'),df_clean]).drop_duplicates(keep=False)

As can be seen by the above data the only issue in the original data is a single duplicated Person ID. Both rows are removed as they have differnet regions so it is not posssible to determine which is correct or if the personID is in fact the issue.

Textual representaiton of Data¶

Now that the data has been loaded in and cleaned so that it is suitable for analysis, in order for the results of any analysis to be meaningful we need to replace the categoric codes that are stored in the data set with a textual description. This is done using the textualize_data function defined inside of textualize.py. This function takes a map of column names onto a map of categorical code to textual representation allowing it to be used for different datasets.

We first define a map containing the textual representation called text_rep. This mapping allows for the the data to be reused on different areas of the notebook across different datasets, provided the mapping is given. We then call the textualize_data on the clean data and store it in the df_text variable.

text_rep = {
    "Region": {
      "E12000001": "North East",
      "E12000002": "North West",
      "E12000003": "Yorkshire and the Humber",
      "E12000004": "East Midlands",
      "E12000005": "West Midlands",
      "E12000006": "East of England",
      "E12000007": "London",
      "E12000008": "South East",
      "E12000009": "South West",
      "W92000004": "Wales"
    },
    "Residence Type": {
      "C": "Resident in a communal establishment",
      "H": "Not resident in a communal establishment"
    },
    "Family Composition": {
      1: "Not in a family",
      2: "Married/same-sex civil partnership couple family",
      3: "Cohabiting couple family",
      4: "Lone parent family (male head)",
      5: "Lone parent family (female head)",
      6: "Other related family",
      -9: "Not required"
    },
    "Population Base": {
      1: "Usual resident",
      2: "Student living away from home",
      3: "Short-term resident"
    },
    "Sex": {
      1: "Male",
      2: "Female"
    },
    "Age": {
      1: "0 to 15",
      2: "16 to 24",
      3: "25 to 34",
      4: "35 to 44",
      5: "45 to 54",
      6: "55 to 64",
      7: "65 to 74",
      8: "75 and over"
    },
    "Marital Status": {
      1: "Single (never married)",
      2: "Married",
      3: "Seperated but still legally married",
      4: "Divorced",
      5: "Widowed"
    },

    "Student": {
      1: "Yes",
      2: "No"
    },
    "Country of Birth": {
      1: "UK",
      2: "Non UK",
      -9: "Not required"
    },
    "Health": {
      1: "Very good",
      2: "Good",
      3: "Fair",
      4: "Bad",
      5: "Very bad",
      -9: "Not required"
    },
    "Ethnic Group": {
      1: "White",
      2: "Mixed",
      3: "Asian",
      4: "Black",
      5: "Chinese or other group",
      -9: "Not required"},

    "Religion": {
      1: "None",
      2: "Christian",
      3: "Buddhist",
      4: "Hindu",
      5: "Jewish",
      6: "Muslim",
      7: "Sikh",
      8: "Other",
      9: "None stated",
      -9: "Not required"
    },
    "Economic Activity": {
      1: "Economically active: Employeed",
      2: "Economically active: Self-employeed",
      3: "Economically active: Unemployed",
      4: "Economically active: Full-time student",
      5: "Economically inactive: Retired",
      6: "Economically inactive: Student",
      7: "Economically inactive: Looking after home or family",
      8: "Economically inactive: Long-term sick or disabled",
      9: "Economically inactive: Other",
      -9: "Not required"
    },

    "Occupation": {
      1: "Managers, Directors and Senior Officials",
      2: "Professional Occupations",
      3: "Associate Professional and Technical Occupations",
      4: "Administrative and Secretarial Occupations",
      5: "Skilled Trades Occupations",
      6: "Caring, Leisure and Other Service Occupations",
      7: "Sales and Customer Service Occupations",
      8: "Process, Plant and Machine Operatives",
      9: "Elementary Occupations",
      -9: "Not required"
    },

    "Industry": {
      1: "Agriculture, forestry and fishing",
      2: "Mining and quarrying; Manufacturing",
      3: "Construction",
      4: "Wholesale and retail trade",
      5: "Accommodation and food service activities",
      6: "Transport and storage; Information and communication",
      7: "Financial and insurance activities",
      8: "Real estate; Professional, scientific and technical; Administrative and support",
      9: "Public administration and defence; compulsory social security",
      10: "Education",
      11: "Human health and social work activities",
      12: "Other community, social and personal service activities",
      -9: "Not required"
    },

    "Hours worked per week": {
      1: "<=15",
      2: "16-30",
      3: "31-48",
      4: ">=48",
      -9: "Not required"
    },
    "Approximated Social Grade": {
      1: "AB",
      2: "C1",
      3: "C2",
      4: "DE",
      -9: "Not required"
    }
}

df_num  = df_clean.copy()
df_text = textualize_data(df_clean, text_rep)

From this point on the variables df_text and df_num can be used to analyse the textual/numeric data respectively.

To save time and improve optimisation these variables should be used after the data has been loaded as reading from the CSV is quite time consuming and there is no reason that the data needs to be reloaded.

Descriptive Analysis of the Dataset¶

First we will look at the size of the dataset. Using len() on the dataframe we can see that there are 569739 records in the clean dataset.

len(df_text)

569739

Data Types¶

The below table shows the datatype of each variable in the dataset and its datatype.

df_num.dtypes.to_frame('Datatype')

The Region and Residence Type are both strings however these are stored as generic objects as it would not be possible to store arbritarily sized strings.Pandas cannot guarantee that the data will in fact be a string so does not specify string. All other variables in the dataset are integers. All the data in the data set is categoric and the values stored represent a certain category for each variable.

Variables in data¶

Each of the following subsections shows different values for each variable as well as how many of each value is in the dataset. By modifying the sort_val_counts variable below you can set wether or not the value counts should be sorted into descending order.

sort_val_counts = True

Region¶

get_val_counts(df_text, 'Region', sort=sort_val_counts)

print("Data type: " + str(df_text['Region'].dtype))

Data type: object

We will also use a bar chart and a map to visualise this distribution of population across the different regions. The map is very useful in comparing which areas are most populus however the bar chart is better to compare actual values.

Region_Counts = df_text['Region'].value_counts()
plotRegion = Region_Counts.plot(kind='barh')
plotRegion.set_xlabel('Number of Records', fontsize=15)
plotRegion.set_ylabel('Region', fontsize=15)
plotRegion.set_title('Number of Records for Each Region')

Text(0.5, 1.0, 'Number of Records for Each Region')

regionMap = folium.Map(location=[53,-1.6], tiles='cartodbpositron', zoom_start=6)
folium.Choropleth(
    geo_data = '../data/NUTS1.geojson',
    data = Region_Counts,
    key_on='feature.properties.nuts118nm',
    name = 'choropleth',
    fill_color = 'GnBu',
    fill_opacity = 0.8,
    line_opacity = 1
).add_to(regionMap)
regionMap

Residence Type¶

get_val_counts(df_text, 'Residence Type', sort=sort_val_counts)

Family Composition¶

get_val_counts(df_text, 'Family Composition', sort=sort_val_counts)

Population Base¶

get_val_counts(df_text, 'Population Base', sort=sort_val_counts)

Sex¶

get_val_counts(df_text, 'Sex', sort=sort_val_counts)

Age¶

get_val_counts(df_text, 'Age', sort=sort_val_counts)

The pie chart below shows the distribution of ages in the population. It is worth noting when looking at this chart that the lowest age range is larger than the other ranges (excluding 75+).

plot = df_text['Age'].value_counts().sort_index().plot(kind='pie', title='Age distribution', autopct='%1.1f%%')
plot.set_ylabel('')

Text(0, 0.5, '')

Marital Status¶

get_val_counts(df_text, 'Marital Status', sort=sort_val_counts)

print("Data type: " + str(df_text['Residence Type'].dtype))

Data type: object

Student¶

get_val_counts(df_text, 'Student', sort=sort_val_counts)

Country of Birth¶

get_val_counts(df_text, 'Country of Birth', sort=sort_val_counts)

Health¶

get_val_counts(df_text, 'Health', sort=sort_val_counts)

Ethnic Group¶

get_val_counts(df_text, 'Ethnic Group', sort=sort_val_counts)

Religion¶

get_val_counts(df_text, 'Religion', sort=sort_val_counts)

Economic Activity¶

get_val_counts(df_text, 'Economic Activity', sort=sort_val_counts)

plot = df_text['Economic Activity'].value_counts().plot(kind='pie', title='Economic Activity', autopct='%1.1f%%'\
                                                                   , labeldistance=None)
plot.set_ylabel('')
plot.legend(fancybox=True, framealpha=1, shadow=True, borderpad=1, bbox_to_anchor=(1.5, 0.75), loc='upper right',\
           prop={'size': 8})

<matplotlib.legend.Legend at 0x7fe78b4e6e10>

Occupation¶

get_val_counts(df_text, 'Occupation', sort=sort_val_counts)

We will now visualise this data in a barchart. Please note that the 'Not required' option has been removed from the data for visualisation.

plotOccupation = df_text['Occupation'].value_counts().drop('Not required').plot(kind='barh')
plotOccupation.set_xlabel('Number of Records')
plotOccupation.set_ylabel('Occupation')
plotOccupation.set_title('Number of Records for Each Occupation')

Text(0.5, 1.0, 'Number of Records for Each Occupation')

Industry¶

get_val_counts(df_text, 'Industry', sort=sort_val_counts)

Hours worked per week¶

get_val_counts(df_text, 'Hours worked per week', sort=sort_val_counts)

Approximated Social Grade¶

get_val_counts(df_text, 'Approximated Social Grade', sort=sort_val_counts)

Distribution of industry by region¶

We are now going to explore how industry varies between regions.

ind_by_region = df_text.groupby(['Industry', 'Region'])['Industry'].count().drop('Not required').unstack('Region')
ind_by_region

We will now look at three different graphs to help visualise this data. The first two are stacked bar charts which both show the same data but witht the tables tranpsosed, followed by a 3D bar chart.

plot = ind_by_region.plot(kind='barh', stacked=True)
plot.set_xlabel('People')

Text(0.5, 0, 'People')

plot = ind_by_region.transpose().plot(kind='barh', stacked=True)
plot.legend(title='Industry', prop={'size': 9}, bbox_to_anchor=(1.9, 0.65), loc='upper right')
plot.set_xlabel('People')

Text(0.5, 0, 'People')

Whilst both of these bar charts display the same data, depending on what you are trying to visualise one can be much more useful than the other. For example if you want to see what the largest industry is in a given region the second graph is much more useful.Where as if you want to compare the total size of each industry then the first is much better.

three_dim_model(df_text,'Region','Industry')

Whilst it is difficult to read exact values, the 3d chart is still a good visualisation tool to be able to see all the data and be able to compare differnet regions or different industries to each other.

Relationship between occupation and social grade¶

sg_by_occ = df_text.groupby(['Occupation', 'Approximated Social Grade'])['Occupation'].count().unstack('Approximated Social Grade')
sg_by_occ = sg_by_occ.drop('Not required').drop('Not required', axis=1)
sg_by_occ

plot = sg_by_occ.plot(kind='barh', stacked=True)
plot.set_xlabel('People')
plot.legend(title='Approximated Social Grade', prop={'size': 9}, bbox_to_anchor=(1, 0.65), loc='upper right')

<matplotlib.legend.Legend at 0x7fe77af8dd90>

three_dim_model(df_text,'Occupation','Approximated Social Grade')

Relationship between region and religion¶

We are now going to investigate the ditributions of relgions across the different regions.

region_to_religion = df_text.groupby(['Region', 'Religion'])['Region'].count()\
                    .unstack('Religion')

region_to_religion = region_to_religion.drop("Not required", axis=1)
region_to_religion

def drawRegRelPlot(bud,chri,hin,jew,mus,none,nonestated,oth,sikh):
    plot_data = region_to_religion
    #drops diffenet religions based on widget input
    if(not bud):
        plot_data = plot_data.drop("Buddhist", axis=1)
    if(not chri):
        plot_data = plot_data.drop("Christian", axis=1)
    if(not hin):
        plot_data = plot_data.drop("Hindu", axis=1)
    if(not jew):
        plot_data = plot_data.drop("Jewish", axis=1)
    if(not mus):
        plot_data = plot_data.drop("Muslim", axis=1)
    if(not none):
        plot_data = plot_data.drop("None", axis=1)
    if(not nonestated):
        plot_data = plot_data.drop("None stated", axis=1)
    if(not oth):
        plot_data = plot_data.drop("Other", axis=1)
    if(not sikh):
        plot_data = plot_data.drop("Sikh", axis=1)
        
        
    plot = plot_data.plot(kind='barh', stacked=True)
    
        
    plot.set_xlabel('People')
    plot.legend(title='Religon', prop={'size': 9}, bbox_to_anchor=(1, 1), loc='upper right')

# create widgets
bud = widgets.Checkbox(value=True, description='Buddhist')
chri = widgets.Checkbox(value=True, description='Christian')
hin = widgets.Checkbox(value=True, description='Hindu')
jew = widgets.Checkbox(value=True, description='Jewish')
mus = widgets.Checkbox(value=True, description='Muslim')
none = widgets.Checkbox(value=True, description='None')
nonestated = widgets.Checkbox(value=True, description='None stated')
oth = widgets.Checkbox(value=True, description='Other')
sikh = widgets.Checkbox(value=True, description='Sikh')

# place widgets in container
container2 = widgets.GridBox([bud,chri,hin,jew,mus,none,nonestated,oth,sikh],
                            layout=widgets.Layout(grid_template_columns="repeat(3, 2in)"))

graph = widgets.interactive_output(drawRegRelPlot, {'bud' : bud, 'chri':chri,\
                                   'hin':hin, 'jew':jew, 'mus':mus, 'none':none,\
                                   'nonestated':nonestated, 'oth':oth, 'sikh':sikh})
display(container2,graph)

The use of widgets in this graph make it very easy to compare different data which would otherwise be difficult to compare due to the fact that Christianity is such a major proportion it makes some other data difficult to read.

We will now visualise this data using a map looking at the percentage of the population of each region for each religion. The religion can be changed using the drop down list abvoe the map.

# Convert to percentage of region
religion_perc = region_to_religion.copy()
for reg in religion_perc.index:
    reg_size = len(df_text[df_text['Region']==reg])
    religion_perc.loc[reg] = region_to_religion.loc[reg].map(lambda x: x / reg_size * 100 )

def generateReligionMap(religion):
    data = religion_perc[religion]
    mapl = folium.Map(location=[53,-1.6], tiles='cartodbpositron', zoom_start=6)
    folium.Choropleth(
        geo_data = '../data/NUTS1.geojson',
        data = data,
        key_on='feature.properties.nuts118nm',
        name = 'choropleth',
        fill_color = 'GnBu',
        fill_opacity = 0.8,
        line_opacity = 1,
        legend_name="Percentage of population (%)"
    ).add_to(mapl)
    display(mapl)

res = df_text['Religion'].unique()
res = [x for x in res if x != 'Not required']

interact(generateReligionMap, religion=res)

<function __main__.generateReligionMap(religion)>

This map allows very good visualisation about where a certain religon may be more prominent within the population. For example when looking at the map for Buddhism it becomes very clear that London is by far the region with the highest proportion of budhists.

Queries¶

Economic activity by region¶

#Use numeric data and then textualize result
dfEconActive = df_num.query('`Economic Activity` < 5 and `Economic Activity` > 0').copy(deep=True)
dfEconActive = textualize_data(dfEconActive, text_rep)

GroupRegAndEcon = dfEconActive.groupby(['Region','Economic Activity'])['Region'].count().unstack()
GroupReg = dfEconActive.groupby(['Region']).size() .reset_index(name='counts')

GroupReg

plot = GroupRegAndEcon.plot(kind='barh', stacked=True)
plot.set_xlabel('People')
plot.legend(title='Economic status', prop={'size': 11}, bbox_to_anchor=(1.6, 0.7), loc='upper right')

<matplotlib.legend.Legend at 0x7fe77a9221d0>

The above graph and table show that whilst the number of economically active people in each area differs the proportion of each category is quite similar

Ecconomic activity by age¶

Using the result from the query for economic activity we can now group by age to find economic activity by age.

groupAge = dfEconActive.groupby(['Age']).size().reset_index(name='counts')
groupAgeAndEcon = dfEconActive.groupby(['Age','Economic Activity'])['Age'].count().unstack()
groupAge

plot = groupAgeAndEcon.plot(kind='barh', stacked=True)
plot.set_xlabel('People')

Text(0.5, 0, 'People')

Discrepencies in student economic activity¶

We will now see if anybody in the data has said that they are a student but do not have an economic activity relating to being a student.

dfStudentOnBoth = df_num.query('Student == 1 and (`Economic Activity` == -9 or `Economic Activity` == 4 or `Economic Activity` == 6)')
dfStudents = df_num.query('Student == 1')

print("Number of individuals with student status of 1 and with economic activity related to student: ", len(dfStudentOnBoth))
print("Number of individuals with student status of 1: ", len(dfStudents))
print("\nDiscrepancies: ", len(dfStudents) - len(dfStudentOnBoth))

Number of individuals with student status of 1 and with economic activity related to student:  126537
Number of individuals with student status of 1:  126537

Discrepancies:  0

As can be seen there are no such discrepancies in the data.

Hours worked per week by students¶

dfEconAct4and6 = df_num.query('Student == 1 and (`Economic Activity` == 4 or `Economic Activity` == 6)').copy(deep=True)
groupHWPW = dfEconAct4and6.groupby(['Hours worked per week']).size().sort_index()
groupHWPW = textualize_data(groupHWPW, text_rep)
groupHWPW.to_frame()

plot = groupHWPW.drop('Not required').plot(kind='barh', title='Hours worked per week by students')
plot.set_xlabel('Students')

Text(0.5, 0, 'Students')

Students which answered 'Not required' to this question have been removed from the data but can easily be added back by removing the drop method call from the above cell.

Optimisations¶

The first optimization of the code I changed the methods that checked the data types of the columns. Rather than specifying the values for each and every column, I designed a program that looped through the columns and assumed the datatypes were a default type specified and the few columns that had a different type would be indicated to the program using a mapping of column to data type. This optimisation allows for the data to scale and if there is a column that is added that has a different data type to user could specify in one place and the program would still work. This allows for the function to be applied to different data sets as long as a different mapping was provided.

The next major optimisation that was added was the addition of a large dictionary that mapped values in the raw data into text. This addition allowed for all the methods to provide textual representations by looping through the dictionary mapping and changing out the values per column. This addition is used throughout the notebook to change the values to text so that the models can be read with ease. This addition allows for the functions to be applied to any data set as long as the dictionary with the text mappings are provided to the program. If the user wanted to change the text mapping for a column all they would need to do is change the value in the dictionary and it would be reflected everywhere. I chose to use a dictionary for the value mappings because python dictionaries are implemented with hashmaps. This allows for us to associate a name (column names) with a set of values and the values could be retrieved in O(1) time.

One of the most time consuming parts of analysing the data is cleansing the data. To improve efficiency and reduce the time taken it is reccomended that onces the data is loaded and cleansed to save a clean version of the data so that in the future this clean data can be loaded and the cleansing only needs to be performed again if the data changes or if additonal cleansing is added.

Removing rows from the data which contrain inadmissable values can be very time consuming however this process was made much fast by first checking the unique values in the column as if these are all admissable then there is no need to try to remove any rows. However this performance benefit is only seen if the data is mostly admissable.

The main focus of many of the optimisations was to improve the modularity of all of the code in the notebook. These changes included making moving the code into seperate modules and then making the functions more general so that they could be applied to other data sets by just changing the values passed into the function to avoid hard coding of plots and analysis.

Reproducability and Reusability¶

The whole notebook has been written in a way such that it is reproducable. For example if the next uk census contained the same variables then the same notebook could be used simply changing the path to the new file containing the data.

The cleansing functionality is completely reusable and is designed to work with any dataframe as long as the correct arguements are passed in whcih describe the dataset. This is also true for the function which is designed to provide textual versions of the data.

There are some values in the dataset that have to be hard coded due to the fact that some values are specific to the dataset that is being used.

Summary of Provenance¶

The data used for the UK NUTS1 boundaries comes from : https://data.gov.uk/dataset/37edc0ad-ffff-47c9-a01c-cb8d6123ec79/nuts-level-1-january-2018-ultra-generalised-clipped-boundaries-in-the-united-kingdom

This data has been modified to remove Scotland and Northern Ireland and some names have been changes to work with the location names in this dataset.

Some influence has been taken from https://towardsdatascience.com/choropleth-maps-with-folium-1a5b8bcdd392 for how to create a choropleth map.

Region	East Midlands	East of England	London	North East	North West	South East	South West	Wales	West Midlands	Yorkshire and the Humber
Industry
Accommodation and food service activities	1936	2336	4054	1300	3355	3566	2712	1641	2281	2555
Agriculture, forestry and fishing	429	616	55	132	357	510	697	403	396	362
Construction	2500	3508	3700	1574	3778	4895	3033	1730	2960	3028
Education	3280	4291	5373	1836	4890	6593	4039	2313	3984	3961
Financial and insurance activities	844	2199	4016	524	1597	3059	1510	594	1149	1284
Human health and social work activities	3952	4977	6109	2524	6764	7344	5085	2887	4848	4855
Mining and quarrying; Manufacturing	6057	5302	3266	2851	7726	6651	5012	3504	7108	5956
Other community, social and personal service activities	1414	2111	3735	848	2299	3316	1977	992	1828	1736
Public administration and defence; compulsory social security	1751	2514	3015	1498	3096	4145	2973	1714	2061	2141
Real estate; Professional, scientific and technical; Administrative and support	3299	5465	9968	1883	5822	8708	4475	2010	4284	4046
Transport and storage; Information and communication	2518	3841	6570	1438	3981	6502	2961	1500	3141	2788
Wholesale and retail trade	6133	7503	8204	3087	9016	10548	6899	3751	7067	6670

Approximated Social Grade	AB	C1	C2	DE
Occupation
Administrative and Secretarial Occupations	3000	44922	2353	2252
Associate Professional and Technical Occupations	7050	35435	647	986
Caring, Leisure and Other Service Occupations	1061	6343	15555	13860
Elementary Occupations	902	7010	6500	42433
Managers, Directors and Senior Officials	19190	18555	584	967
Process, Plant and Machine Operatives	413	1719	11157	21086
Professional Occupations	48104	13223	891	1009
Sales and Customer Service Occupations	964	12184	2997	21347
Skilled Trades Occupations	585	2464	37190	7629

Religion	Buddhist	Christian	Hindu	Jewish	Muslim	None	None stated	Other	Sikh
Region
East Midlands	139	26733	896	29	1386	12399	3074	174	429
East of England	246	35051	546	323	1476	16374	4205	240	185
London	819	39992	4136	1498	10316	17062	7045	451	1266
North East	65	17537	59	50	460	6176	1630	77	47
North West	214	47316	388	298	3602	14168	4436	185	80
South East	447	52165	906	164	1977	23798	6334	419	547
South West	189	31942	172	61	496	15638	4187	305	62
Wales	103	17612	109	17	442	9860	2358	116	24
West Midlands	158	33727	726	43	3772	12494	3735	275	1334
Yorkshire and the Humber	158	31406	274	89	3313	13689	3609	164	241

	Person ID	Region	Residence Type	Family Composition	Population Base	Sex	Age	Marital Status	Student	Country of Birth	Health	Ethnic Group	Religion	Economic Activity	Occupation	Industry	Hours worked per week	Approximated Social Grade
0	7394816	E12000001	H	2	1	2	6	2	2	1	2	1	2	5	8	2	-9	4
1	7394745	E12000001	H	5	1	1	4	1	2	1	1	1	2	1	8	6	4	3
2	7395066	E12000001	H	3	1	2	4	1	2	1	1	1	1	1	6	11	3	4
3	7395329	E12000001	H	3	1	2	2	1	2	1	2	1	2	1	7	7	3	2
4	7394712	E12000001	H	3	1	1	5	4	2	1	1	1	2	1	1	4	3	2
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
569736	7946020	W92000004	H	1	1	1	5	1	2	1	4	1	9	1	8	8	3	3
569737	7944310	W92000004	H	3	1	1	3	1	2	1	2	1	1	1	7	4	3	4
569738	7945374	W92000004	H	3	1	1	1	1	1	1	1	1	2	-9	-9	-9	-9	-9
569739	7944768	W92000004	H	1	1	2	8	5	2	1	3	1	9	5	9	2	-9	4
569740	7944959	W92000004	H	2	1	2	2	2	2	1	2	1	1	1	7	4	1	4

	Region	Residence Type	Family Composition	Population Base	Sex	Age	Marital Status	Student	Country of Birth	Health	Ethnic Group	Religion	Economic Activity	Occupation	Industry	Hours worked per week	Approximated Social Grade
Person ID
7959238	E12000008	H	2	1	1	6	2	2	2	2	3	4	1	8	3	3	4
7959238	W92000004	H	2	1	1	6	2	2	2	2	3	4	1	8	3	3	4

	Datatype
Region	object
Residence Type	object
Family Composition	int64
Population Base	int64
Sex	int64
Age	int64
Marital Status	int64
Student	int64
Country of Birth	int64
Health	int64
Ethnic Group	int64
Religion	int64
Economic Activity	int64
Occupation	int64
Industry	int64
Hours worked per week	int64
Approximated Social Grade	int64

	Count
South East	88083
London	83582
North West	71436
East of England	59411
West Midlands	56875
South West	53774
Yorkshire and the Humber	53471
East Midlands	45782
Wales	30976
North East	26349

	Count
Not resident in a communal establishment	559085
Resident in a communal establishment	10654

	Count
Married/same-sex civil partnership couple family	300960
Not in a family	96690
Cohabiting couple family	72641
Lone parent family (female head)	64519
Not required	18851
Lone parent family (male head)	9848
Other related family	6230

	Count
Usual resident	561038
Student living away from home	6730
Short-term resident	1971

	Count
0 to 15	106832
35 to 44	78641
45 to 54	77388
25 to 34	75948
16 to 24	72785
55 to 64	65664
65 to 74	48777
75 and over	43704

	Count
Single (never married)	270999
Married	214178
Divorced	40713
Widowed	31898
Seperated but still legally married	11951

	Count
Very good	264971
Good	191742
Fair	74480
Bad	24558
Very bad	7184
Not required	6804

	Count
Female	289172
Male	280567

	Count
No	443202
Yes	126537

	Count
White	483477
Asian	42710
Black	18786
Mixed	12209
Not required	6804
Chinese or other group	5753

	Count
Economically active: Employeed	216023
Not required	112618
Economically inactive: Retired	97480
Economically active: Self-employeed	40632
Economically inactive: Student	24756
Economically active: Unemployed	18109
Economically inactive: Long-term sick or disabled	17991
Economically inactive: Looking after home or family	17945
Economically active: Full-time student	14117
Economically inactive: Other	10068

	Count
Christian	333481
None	141658
None stated	40613
Muslim	27240
Hindu	8212
Not required	6804
Sikh	4215
Jewish	2572
Buddhist	2538
Other	2406

	Count
Not required	149984
Professional Occupations	64111
Elementary Occupations	58483
Administrative and Secretarial Occupations	53254
Skilled Trades Occupations	48546
Associate Professional and Technical Occupations	44937
Managers, Directors and Senior Officials	39788
Sales and Customer Service Occupations	38523
Caring, Leisure and Other Service Occupations	37297
Process, Plant and Machine Operatives	34816

	Region	counts
0	East Midlands	23106
1	East of England	30568
2	London	44454
3	North East	12897
4	North West	35204
5	South East	45550
6	South West	27453
7	Wales	14876
8	West Midlands	27930
9	Yorkshire and the Humber	26843

	Age	counts
0	16 to 24	41663
1	25 to 34	64326
2	35 to 44	67050
3	45 to 54	65736
4	55 to 64	40582
5	65 to 74	8022
6	75 and over	1502

	0
Hours worked per week
Not required	27192
<=15	6465
16-30	2334
31-48	1683
>=48	281