Exploring the vast crime dataset with 19M entries using the PivotPal Python package.
1. Dataset Overview: A snapshot of the crime dataset's general attributes.
Learn More:2. Missing Data: Identifying gaps in the dataset.
Learn More:3. Outcome Distribution: Analyzing the 'Last outcome category'.
Learn More:4. Crime Type Distribution: Understanding prevalent crime types.
Learn More:Overview: Utilizing the PivotPal Python package, we delve into the vast crime dataset, which boasts an impressive 19 million entries, to provide a holistic understanding of its structure and characteristics.
Explanation: Kaggle datasets often come with a myriad of features and entries. The crime dataset is no exception, with a staggering 19 million rows. In this section, we'll shed light on the dataset's general attributes, from column types to missing values, ensuring a robust understanding for any aspiring data scientist or enthusiast.
pp.overview(crime_data)
Description | Count |
---|---|
Total Rows | 19,269,992 |
Total Columns | 12 |
Columns with Missing Values | 7 |
Total Duplicate Rows | 1,455,794 |
Most Frequent Data Type | object |
Columns with Binary Values | 0 |
Columns with Zero Values | 0 |
Unique Data Types | 2 |
Numeric Columns | 3 |
Non-Numeric Columns | 9 |
The table provides a snapshot of the crime dataset. Notably, with 19 million entries, it's a sizable dataset. The majority data type is 'object', and there are both numeric and non-numeric columns. Interestingly, despite its size, there are no columns with binary or zero values.
Overview: A deep dive into the missing data within the crime dataset using the PivotPal Python package.
Explanation: The crime dataset provides comprehensive data about various crimes reported. In this exploration, we'll focus on identifying and understanding the missing data within this dataset.
pp.missing(crime_data)
Column Name | Missing Count | Missing % |
---|---|---|
Context | 19,269,992 | 100.0 |
Last outcome category | 4,356,540 | 23.0 |
Crime ID | 4,036,619 | 21.0 |
LSOA code | 821,598 | 4.0 |
LSOA name | 821,598 | 4.0 |
Longitude | 322,599 | 2.0 |
Latitude | 322,599 | 2.0 |
The table above showcases the columns in the crime dataset with missing values. The 'Context' column has the highest number of missing values, accounting for 100% of the total data. Other columns like 'Last outcome category' and 'Crime ID' also have significant missing values.
Overview: Using the PivotPal Python package, we explore the distribution of the 'Last outcome category' within the crime dataset, which contains over 19 million entries.
Explanation: In large datasets like the one from Kaggle's crime records, understanding the distribution of specific columns is crucial. Here, we focus on the 'Last outcome category' to discern the most common outcomes of reported crimes.
pp.distribution(crime_data, 'Last outcome category')
Last outcome category | Count | % |
---|---|---|
Investigation complete; no suspect identified | 6,131,104 | 31.82 |
Unable to prosecute suspect | 4,935,065 | 25.61 |
Status update unavailable | 1,126,295 | 5.84 |
Under investigation | 857,878 | 4.45 |
Court result unavailable | 669,807 | 3.48 |
Local resolution | 361,615 | 1.88 |
Awaiting court outcome | 196,928 | 1.02 |
Action to be taken by another organisation | 168,733 | 0.88 |
Offender given a caution | 140,223 | 0.73 |
Further investigation is not in the public interest | 127,227 | 0.66 |
Formal action is not in the public interest | 76,340 | 0.40 |
Further action is not in the public interest | 72,329 | 0.38 |
Offender given penalty notice | 24,397 | 0.13 |
Offender given a drugs possession warning | 19,000 | 0.10 |
Suspect charged as part of another case | 6,511 | 0.03 |
The table illustrates the distribution of outcomes in the 'Last outcome category'. The most frequent outcome is 'Investigation complete; no suspect identified' with over 6.1 million occurrences, accounting for 31.82% of the dataset. This is followed by 'Unable to prosecute suspect' with nearly 5 million occurrences. The data provides valuable insights into the resolution of reported crimes.
Overview: Using the PivotPal Python package, we delve into the crime dataset, which contains over 19 million entries, to understand the distribution of different crime types.
Explanation: Kaggle datasets, especially those as extensive as the crime dataset with its 19 million records, offer a plethora of insights. In this section, we'll focus on the 'Crime type' column to understand the most prevalent types of reported crimes.
pp.distribution(crime_data, 'Crime type')
Crime type | Count | % |
---|---|---|
Violence and sexual offences | 6,480,661 | 33.63 |
Anti-social behaviour | 3,886,040 | 20.17 |
Public order | 1,577,873 | 8.19 |
Criminal damage and arson | 1,525,290 | 7.92 |
Other theft | 1,318,322 | 6.84 |
Vehicle crime | 1,086,004 | 5.64 |
Shoplifting | 901,923 | 4.68 |
Burglary | 776,869 | 4.03 |
Drugs | 542,978 | 2.82 |
Other crime | 332,155 | 1.72 |
Theft from the person | 268,202 | 1.39 |
Bicycle theft | 227,533 | 1.18 |
Robbery | 200,031 | 1.04 |
Possession of weapons | 146,111 | 0.76 |
The table above highlights the distribution of various crime types in the dataset. 'Violence and sexual offences' is the most common crime type, accounting for 33.63% of the dataset. This is followed by 'Anti-social behaviour' and 'Public order'. The data provides a comprehensive view of the nature of reported crimes.
Overview: Using the provided mapping of police forces to their respective geographical regions, we've engineered a new feature, 'Geographical_Region', in our crime dataset, which contains over 19 million entries.
Explanation: Datasets often require additional context or categorization to derive meaningful insights. In this section, we've taken the 'Reported by' column, which indicates the police force that reported the crime, and mapped it to a broader 'Geographical_Region'. This allows for a more aggregated view of crime distribution across the UK, facilitating regional comparisons and analyses.
# Drop Ireland and Wales to focus on English Regions
forces_to_drop = ['Police Service of Northern Ireland', 'Dyfed-Powys Police', 'North Wales Police', 'South Wales Police', 'Gwent Police']
data = data[~data['Reported by'].isin(forces_to_drop)]
geographical_regions = {
"South West England": [
"Avon and Somerset Constabulary",
"Devon & Cornwall Police",
"Dorset Police",
"Gloucestershire Constabulary",
"Wiltshire Police"
],
"South East England": [
"Hampshire Constabulary",
"Kent Police",
"Surrey Police",
"Sussex Police",
"Thames Valley Police"
],
"East of England": [
"Bedfordshire Police",
"Cambridgeshire Constabulary",
"Essex Police",
"Hertfordshire Constabulary",
"Norfolk Constabulary",
"Suffolk Constabulary"
],
"London": [
"Metropolitan Police Service",
"City of London Police"
],
"West Midlands": [
"Staffordshire Police",
"Warwickshire Police",
"West Mercia Police",
"West Midlands Police"
],
"East Midlands": [
"Derbyshire Constabulary",
"Leicestershire Police",
"Lincolnshire Police",
"Northamptonshire Police",
"Nottinghamshire Police"
],
"North West England": [
"Cheshire Constabulary",
"Cumbria Constabulary",
"Greater Manchester Police",
"Lancashire Constabulary",
"Merseyside Police"
],
"North East England": [
"Cleveland Police",
"Durham Constabulary",
"Northumbria Police",
"North Yorkshire Police"
],
"Yorkshire and the Humber": [
"Humberside Police",
"South Yorkshire Police",
"West Yorkshire Police"
],
"Specialized/Other": [
"British Transport Police",
]
}
# Feature engineering Forces by Geographical Regions
def get_geographical_region(force_name):
for region, forces in geographical_regions.items():
if force_name in forces:
return region
return None
data['Geographical_Region'] = data['Reported by'].apply(get_geographical_region)
# See Crime Distribution around different Geographical Regions
pp.distribution(data, 'Geographical_Region')
Geographical Region | Count | % |
---|---|---|
London | 3,446,252 | 19.37 |
South East England | 2,622,798 | 14.74 |
West Midlands | 2,013,252 | 11.31 |
Yorkshire and the Humber | 1,896,734 | 10.66 |
East of England | 1,781,383 | 10.01 |
North West England | 1,621,231 | 9.11 |
East Midlands | 1,522,387 | 8.56 |
South West England | 1,396,609 | 7.85 |
North East England | 1,342,356 | 7.54 |
Specialized/Other | 150,579 | 0.85 |
The table above highlights the distribution of crimes across different geographical regions in the UK. 'London' has the highest count, accounting for 19.37% of the dataset. This is followed by 'South East England' and 'West Midlands'. The data provides a comprehensive view of the geographical distribution of reported crimes.