Crime Dataset Analysis

Exploring the vast crime dataset with 19M entries using the PivotPal Python package.

1. Dataset Overview: A snapshot of the crime dataset's general attributes.
Learn More:
2. Missing Data: Identifying gaps in the dataset.
Learn More:
3. Outcome Distribution: Analyzing the 'Last outcome category'.
Learn More:
4. Crime Type Distribution: Understanding prevalent crime types.
Learn More:

1. Comprehensive Overview of the Crime Dataset with 19M Entries

Overview: Utilizing the PivotPal Python package, we delve into the vast crime dataset, which boasts an impressive 19 million entries, to provide a holistic understanding of its structure and characteristics.

Explanation: Kaggle datasets often come with a myriad of features and entries. The crime dataset is no exception, with a staggering 19 million rows. In this section, we'll shed light on the dataset's general attributes, from column types to missing values, ensuring a robust understanding for any aspiring data scientist or enthusiast.

pp.overview(crime_data)

Description	Count
Total Rows	19,269,992
Total Columns	12
Columns with Missing Values	7
Total Duplicate Rows	1,455,794
Most Frequent Data Type	object
Columns with Binary Values	0
Columns with Zero Values	0
Unique Data Types	2
Numeric Columns	3
Non-Numeric Columns	9

The table provides a snapshot of the crime dataset. Notably, with 19 million entries, it's a sizable dataset. The majority data type is 'object', and there are both numeric and non-numeric columns. Interestingly, despite its size, there are no columns with binary or zero values.

2. Exploring Missing Data in the Crime Dataset

Overview: A deep dive into the missing data within the crime dataset using the PivotPal Python package.

Explanation: The crime dataset provides comprehensive data about various crimes reported. In this exploration, we'll focus on identifying and understanding the missing data within this dataset.

pp.missing(crime_data)

Column Name	Missing Count	Missing %
Context	19,269,992	100.0
Last outcome category	4,356,540	23.0
Crime ID	4,036,619	21.0
LSOA code	821,598	4.0
LSOA name	821,598	4.0
Longitude	322,599	2.0
Latitude	322,599	2.0

The table above showcases the columns in the crime dataset with missing values. The 'Context' column has the highest number of missing values, accounting for 100% of the total data. Other columns like 'Last outcome category' and 'Crime ID' also have significant missing values.

3. Distribution Analysis of 'Last outcome category' in the Crime Dataset

Overview: Using the PivotPal Python package, we explore the distribution of the 'Last outcome category' within the crime dataset, which contains over 19 million entries.

Explanation: In large datasets like the one from Kaggle's crime records, understanding the distribution of specific columns is crucial. Here, we focus on the 'Last outcome category' to discern the most common outcomes of reported crimes.

pp.distribution(crime_data, 'Last outcome category')

Last outcome category	Count	%
Investigation complete; no suspect identified	6,131,104	31.82
Unable to prosecute suspect	4,935,065	25.61
Status update unavailable	1,126,295	5.84
Under investigation	857,878	4.45
Court result unavailable	669,807	3.48
Local resolution	361,615	1.88
Awaiting court outcome	196,928	1.02
Action to be taken by another organisation	168,733	0.88
Offender given a caution	140,223	0.73
Further investigation is not in the public interest	127,227	0.66
Formal action is not in the public interest	76,340	0.40
Further action is not in the public interest	72,329	0.38
Offender given penalty notice	24,397	0.13
Offender given a drugs possession warning	19,000	0.10
Suspect charged as part of another case	6,511	0.03

The table illustrates the distribution of outcomes in the 'Last outcome category'. The most frequent outcome is 'Investigation complete; no suspect identified' with over 6.1 million occurrences, accounting for 31.82% of the dataset. This is followed by 'Unable to prosecute suspect' with nearly 5 million occurrences. The data provides valuable insights into the resolution of reported crimes.

4. Distribution Analysis of 'Crime type' in the Crime Dataset

Overview: Using the PivotPal Python package, we delve into the crime dataset, which contains over 19 million entries, to understand the distribution of different crime types.

Explanation: Kaggle datasets, especially those as extensive as the crime dataset with its 19 million records, offer a plethora of insights. In this section, we'll focus on the 'Crime type' column to understand the most prevalent types of reported crimes.

pp.distribution(crime_data, 'Crime type')

Crime type	Count	%
Violence and sexual offences	6,480,661	33.63
Anti-social behaviour	3,886,040	20.17
Public order	1,577,873	8.19
Criminal damage and arson	1,525,290	7.92
Other theft	1,318,322	6.84
Vehicle crime	1,086,004	5.64
Shoplifting	901,923	4.68
Burglary	776,869	4.03
Drugs	542,978	2.82
Other crime	332,155	1.72
Theft from the person	268,202	1.39
Bicycle theft	227,533	1.18
Robbery	200,031	1.04
Possession of weapons	146,111	0.76

The table above highlights the distribution of various crime types in the dataset. 'Violence and sexual offences' is the most common crime type, accounting for 33.63% of the dataset. This is followed by 'Anti-social behaviour' and 'Public order'. The data provides a comprehensive view of the nature of reported crimes.

5. Feature Engineering: Mapping 'Reported by' to 'Geographical Region'

Overview: Using the provided mapping of police forces to their respective geographical regions, we've engineered a new feature, 'Geographical_Region', in our crime dataset, which contains over 19 million entries.

Explanation: Datasets often require additional context or categorization to derive meaningful insights. In this section, we've taken the 'Reported by' column, which indicates the police force that reported the crime, and mapped it to a broader 'Geographical_Region'. This allows for a more aggregated view of crime distribution across the UK, facilitating regional comparisons and analyses.


# Drop Ireland and Wales to focus on English Regions

forces_to_drop = ['Police Service of Northern Ireland', 'Dyfed-Powys Police', 'North Wales Police', 'South Wales Police', 'Gwent Police']

data = data[~data['Reported by'].isin(forces_to_drop)]

geographical_regions = {
    "South West England": [
        "Avon and Somerset Constabulary",
        "Devon & Cornwall Police",
        "Dorset Police",
        "Gloucestershire Constabulary",
        "Wiltshire Police"
    ],
    "South East England": [
        "Hampshire Constabulary",
        "Kent Police",
        "Surrey Police",
        "Sussex Police",
        "Thames Valley Police"
    ],
    "East of England": [
        "Bedfordshire Police",
        "Cambridgeshire Constabulary",
        "Essex Police",
        "Hertfordshire Constabulary",
        "Norfolk Constabulary",
        "Suffolk Constabulary"
    ],
    "London": [
        "Metropolitan Police Service",
        "City of London Police"
    ],
    "West Midlands": [
        "Staffordshire Police",
        "Warwickshire Police",
        "West Mercia Police",
        "West Midlands Police"
    ],
    "East Midlands": [
        "Derbyshire Constabulary",
        "Leicestershire Police",
        "Lincolnshire Police",
        "Northamptonshire Police",
        "Nottinghamshire Police"
    ],
    "North West England": [
        "Cheshire Constabulary",
        "Cumbria Constabulary",
        "Greater Manchester Police",
        "Lancashire Constabulary",
        "Merseyside Police"
    ],
    "North East England": [
        "Cleveland Police",
        "Durham Constabulary",
        "Northumbria Police",
        "North Yorkshire Police"
    ],
    "Yorkshire and the Humber": [
        "Humberside Police",
        "South Yorkshire Police",
        "West Yorkshire Police"
    ],
    "Specialized/Other": [
        "British Transport Police",
    ]
}

# Feature engineering Forces by Geographical Regions

def get_geographical_region(force_name):
    for region, forces in geographical_regions.items():
        if force_name in forces:
            return region
    return None

data['Geographical_Region'] = data['Reported by'].apply(get_geographical_region)


# See Crime Distribution around different Geographical Regions

pp.distribution(data, 'Geographical_Region')

Geographical Region	Count	%
London	3,446,252	19.37
South East England	2,622,798	14.74
West Midlands	2,013,252	11.31
Yorkshire and the Humber	1,896,734	10.66
East of England	1,781,383	10.01
North West England	1,621,231	9.11
East Midlands	1,522,387	8.56
South West England	1,396,609	7.85
North East England	1,342,356	7.54
Specialized/Other	150,579	0.85