Crime Dataset Analysis

Exploring the vast crime dataset with 19M entries using the PivotPal Python package.


Table of Contents

  1. 1. Dataset Overview: A snapshot of the crime dataset's general attributes.

    Learn More:
  2. 2. Missing Data: Identifying gaps in the dataset.

    Learn More:
  3. 3. Outcome Distribution: Analyzing the 'Last outcome category'.

    Learn More:
  4. 4. Crime Type Distribution: Understanding prevalent crime types.

    Learn More:

1. Comprehensive Overview of the Crime Dataset with 19M Entries

Overview: Utilizing the PivotPal Python package, we delve into the vast crime dataset, which boasts an impressive 19 million entries, to provide a holistic understanding of its structure and characteristics.

Explanation: Kaggle datasets often come with a myriad of features and entries. The crime dataset is no exception, with a staggering 19 million rows. In this section, we'll shed light on the dataset's general attributes, from column types to missing values, ensuring a robust understanding for any aspiring data scientist or enthusiast.

pp.overview(crime_data)
DescriptionCount
Total Rows19,269,992
Total Columns12
Columns with Missing Values7
Total Duplicate Rows1,455,794
Most Frequent Data Typeobject
Columns with Binary Values0
Columns with Zero Values0
Unique Data Types2
Numeric Columns3
Non-Numeric Columns9

The table provides a snapshot of the crime dataset. Notably, with 19 million entries, it's a sizable dataset. The majority data type is 'object', and there are both numeric and non-numeric columns. Interestingly, despite its size, there are no columns with binary or zero values.

2. Exploring Missing Data in the Crime Dataset

Overview: A deep dive into the missing data within the crime dataset using the PivotPal Python package.

Explanation: The crime dataset provides comprehensive data about various crimes reported. In this exploration, we'll focus on identifying and understanding the missing data within this dataset.

pp.missing(crime_data)
Column NameMissing CountMissing %
Context19,269,992100.0
Last outcome category4,356,54023.0
Crime ID4,036,61921.0
LSOA code821,5984.0
LSOA name821,5984.0
Longitude322,5992.0
Latitude322,5992.0

The table above showcases the columns in the crime dataset with missing values. The 'Context' column has the highest number of missing values, accounting for 100% of the total data. Other columns like 'Last outcome category' and 'Crime ID' also have significant missing values.

3. Distribution Analysis of 'Last outcome category' in the Crime Dataset

Overview: Using the PivotPal Python package, we explore the distribution of the 'Last outcome category' within the crime dataset, which contains over 19 million entries.

Explanation: In large datasets like the one from Kaggle's crime records, understanding the distribution of specific columns is crucial. Here, we focus on the 'Last outcome category' to discern the most common outcomes of reported crimes.

pp.distribution(crime_data, 'Last outcome category')
Last outcome categoryCount%
Investigation complete; no suspect identified6,131,10431.82
Unable to prosecute suspect4,935,06525.61
Status update unavailable1,126,2955.84
Under investigation857,8784.45
Court result unavailable669,8073.48
Local resolution361,6151.88
Awaiting court outcome196,9281.02
Action to be taken by another organisation168,7330.88
Offender given a caution140,2230.73
Further investigation is not in the public interest127,2270.66
Formal action is not in the public interest76,3400.40
Further action is not in the public interest72,3290.38
Offender given penalty notice24,3970.13
Offender given a drugs possession warning19,0000.10
Suspect charged as part of another case6,5110.03

The table illustrates the distribution of outcomes in the 'Last outcome category'. The most frequent outcome is 'Investigation complete; no suspect identified' with over 6.1 million occurrences, accounting for 31.82% of the dataset. This is followed by 'Unable to prosecute suspect' with nearly 5 million occurrences. The data provides valuable insights into the resolution of reported crimes.

4. Distribution Analysis of 'Crime type' in the Crime Dataset

Overview: Using the PivotPal Python package, we delve into the crime dataset, which contains over 19 million entries, to understand the distribution of different crime types.

Explanation: Kaggle datasets, especially those as extensive as the crime dataset with its 19 million records, offer a plethora of insights. In this section, we'll focus on the 'Crime type' column to understand the most prevalent types of reported crimes.

pp.distribution(crime_data, 'Crime type')
Crime typeCount%
Violence and sexual offences6,480,66133.63
Anti-social behaviour3,886,04020.17
Public order1,577,8738.19
Criminal damage and arson1,525,2907.92
Other theft1,318,3226.84
Vehicle crime1,086,0045.64
Shoplifting901,9234.68
Burglary776,8694.03
Drugs542,9782.82
Other crime332,1551.72
Theft from the person268,2021.39
Bicycle theft227,5331.18
Robbery200,0311.04
Possession of weapons146,1110.76

The table above highlights the distribution of various crime types in the dataset. 'Violence and sexual offences' is the most common crime type, accounting for 33.63% of the dataset. This is followed by 'Anti-social behaviour' and 'Public order'. The data provides a comprehensive view of the nature of reported crimes.

5. Feature Engineering: Mapping 'Reported by' to 'Geographical Region'

Overview: Using the provided mapping of police forces to their respective geographical regions, we've engineered a new feature, 'Geographical_Region', in our crime dataset, which contains over 19 million entries.

Explanation: Datasets often require additional context or categorization to derive meaningful insights. In this section, we've taken the 'Reported by' column, which indicates the police force that reported the crime, and mapped it to a broader 'Geographical_Region'. This allows for a more aggregated view of crime distribution across the UK, facilitating regional comparisons and analyses.


# Drop Ireland and Wales to focus on English Regions

forces_to_drop = ['Police Service of Northern Ireland', 'Dyfed-Powys Police', 'North Wales Police', 'South Wales Police', 'Gwent Police']

data = data[~data['Reported by'].isin(forces_to_drop)]

geographical_regions = {
    "South West England": [
        "Avon and Somerset Constabulary",
        "Devon & Cornwall Police",
        "Dorset Police",
        "Gloucestershire Constabulary",
        "Wiltshire Police"
    ],
    "South East England": [
        "Hampshire Constabulary",
        "Kent Police",
        "Surrey Police",
        "Sussex Police",
        "Thames Valley Police"
    ],
    "East of England": [
        "Bedfordshire Police",
        "Cambridgeshire Constabulary",
        "Essex Police",
        "Hertfordshire Constabulary",
        "Norfolk Constabulary",
        "Suffolk Constabulary"
    ],
    "London": [
        "Metropolitan Police Service",
        "City of London Police"
    ],
    "West Midlands": [
        "Staffordshire Police",
        "Warwickshire Police",
        "West Mercia Police",
        "West Midlands Police"
    ],
    "East Midlands": [
        "Derbyshire Constabulary",
        "Leicestershire Police",
        "Lincolnshire Police",
        "Northamptonshire Police",
        "Nottinghamshire Police"
    ],
    "North West England": [
        "Cheshire Constabulary",
        "Cumbria Constabulary",
        "Greater Manchester Police",
        "Lancashire Constabulary",
        "Merseyside Police"
    ],
    "North East England": [
        "Cleveland Police",
        "Durham Constabulary",
        "Northumbria Police",
        "North Yorkshire Police"
    ],
    "Yorkshire and the Humber": [
        "Humberside Police",
        "South Yorkshire Police",
        "West Yorkshire Police"
    ],
    "Specialized/Other": [
        "British Transport Police",
    ]
}

# Feature engineering Forces by Geographical Regions

def get_geographical_region(force_name):
    for region, forces in geographical_regions.items():
        if force_name in forces:
            return region
    return None

data['Geographical_Region'] = data['Reported by'].apply(get_geographical_region)


# See Crime Distribution around different Geographical Regions

pp.distribution(data, 'Geographical_Region')
        
Geographical RegionCount%
London3,446,25219.37
South East England2,622,79814.74
West Midlands2,013,25211.31
Yorkshire and the Humber1,896,73410.66
East of England1,781,38310.01
North West England1,621,2319.11
East Midlands1,522,3878.56
South West England1,396,6097.85
North East England1,342,3567.54
Specialized/Other150,5790.85

The table above highlights the distribution of crimes across different geographical regions in the UK. 'London' has the highest count, accounting for 19.37% of the dataset. This is followed by 'South East England' and 'West Midlands'. The data provides a comprehensive view of the geographical distribution of reported crimes.