This exploration delves deep into the Titanic dataset, a classic in the data science community. The dataset, which contains various features describing the passengers aboard the ill-fated ship, is dissected using the PivotPal Python package. The analysis focuses on understanding the dataset's structure, missing values, unique data points, and the distribution of key features such as age.
Beyond the initial exploration, the study also emphasizes feature engineering, a crucial step in data preprocessing. By extracting titles from names and categorizing age into distinct groups, the analysis unveils patterns that can be instrumental in understanding survival rates, demographics, and social statuses of the passengers. Each step in the exploration is meticulously detailed, providing insights and showcasing the capabilities of the PivotPal Python package in handling and analyzing datasets.
1. Dataset Overview: A snapshot of the Titanic dataset's general attributes, revealing its structure, data types, and characteristics.
Learn More:2. Missing Data: A deep dive into the columns with missing values, highlighting the significance of each and potential implications for analysis.
Learn More:3. Unique Data Values: A comprehensive look at the unique values present in each column, emphasizing the distribution and diversity of data within the dataset.
Learn More:4. Age Distribution: An in-depth analysis of the age distribution of passengers, revealing the diverse age groups present on the ship.
Learn More:5. Age Category Engineering: The transformation of continuous 'Age' values into discrete age groups, offering a more aggregated view of age distributions.
Learn More:6. Title Extraction: A feature engineering step that extracts titles from names, providing insights into the social status, gender, and age group of passengers.
Learn More:Overview: Utilizing the `pp.overview` function from the PivotPal Python package, we've extracted a comprehensive summary of the Titanic dataset.
Explanation: The Titanic dataset, a classic in the data science community, contains various features describing the passengers aboard the ill-fated ship. With `pp.overview`, we can quickly gain insights into the dataset's structure, missing values, data types, and more. This high-level overview is crucial for understanding the data before diving into deeper analyses.
# Using pp.overview to summarise the Titanic dataset
pp.overview(titanic_data)
Description | Count |
---|---|
Total Rows | 891 |
Total Columns | 12 |
Columns with Missing Values | 3 |
Total Duplicate Rows | 0 |
Most Frequent Data Type | int64 |
Columns with Binary Values | 2 |
Columns with Zero Values | 4 |
Unique Data Types | 3 |
Numeric Columns | 7 |
Non-Numeric Columns | 5 |
The table above provides a snapshot of the Titanic dataset's structure and characteristics. With 891 rows and 12 columns, the dataset offers a mix of numeric and non-numeric data. Notably, there are three columns with missing values, which will require attention during the data preprocessing phase. The `pp.overview` function proves invaluable for such quick and informative explorations.
Overview: A deep dive into the missing data within the Titanic dataset using the PivotPal Python package.
Explanation: The Titanic dataset is one of the most popular datasets used in data science. It contains information about the passengers onboard the Titanic, including their age, cabin, and embarkation point. In this exploration, we'll focus on identifying and understanding the missing data within this dataset.
pp.missing(df)
Column Name | Missing Count | Missing % |
---|---|---|
Cabin | 687 | 77.0 |
Age | 177 | 20.0 |
Embarked | 2 | 0.0 |
The table above showcases the columns in the Titanic dataset with missing values. The 'Cabin' column has the highest number of missing values, with 687 missing entries, accounting for 77% of the total data. The 'Age' column has 177 missing values, which is 20% of the data. Lastly, the 'Embarked' column has only 2 missing values, making up 0% of the dataset.
Overview: A comprehensive look at the unique data values within the Titanic dataset using the PivotPal Python package.
Explanation: The Titanic dataset provides a wealth of information about the passengers onboard. One of the key steps in data exploration is understanding the uniqueness of data values. In this exploration, we'll identify and understand the unique values present in each column of the Titanic dataset.
pp.unique(df)
Column Name | Unique Count |
---|---|
PassengerId | 891 |
Name | 891 |
Ticket | 681 |
Fare | 248 |
Cabin | 147 |
Age | 88 |
SibSp | 7 |
Parch | 7 |
Pclass | 3 |
Embarked | 3 |
Survived | 2 |
Sex | 2 |
The table above highlights the unique data values in each column of the Titanic dataset. Columns like 'PassengerId' and 'Name' have unique values for each entry, while columns like 'Sex' and 'Survived' have only 2 unique values. This information is crucial for understanding the distribution and diversity of data within the dataset.
Overview: Using the PivotPal Python package, we've analyzed the age distribution of passengers in the Titanic dataset.
Explanation: The age of passengers aboard the Titanic varies widely, from infants to the elderly. Understanding this distribution can provide insights into the demographics of the ship's passengers and potentially reveal patterns related to survival rates based on age groups.
# Using PivotPal to get 'Age' feature and it's distribution
pp.distribution(titanic_data, 'Age')
Age | Count | % |
---|---|---|
24.00 | 30 | 3.37 |
22.00 | 27 | 3.03 |
18.00 | 26 | 2.92 |
19.00 | 25 | 2.81 |
28.00 | 25 | 2.81 |
... | ... | ... |
66.00 | 1 | 0.11 |
0.67 | 1 | 0.11 |
0.42 | 1 | 0.11 |
34.50 | 1 | 0.11 |
74.00 | 1 | 0.11 |
The table above showcases the age distribution of passengers aboard the Titanic. The most common age is 24, with 30 passengers (3.37% of the dataset). The dataset contains a wide range of ages, from infants (e.g., 0.42 years) to elderly passengers (e.g., 74 years). This distribution helps us understand the diverse age groups present on the ship. Now it's time to feature engineer...
Overview: By leveraging the `pd.cut` function and the PivotPal Python package, we've engineered a new feature, 'AgeCategory', which categorizes passengers into age groups based on their age.
Explanation: The age of passengers aboard the Titanic spans a wide range. To simplify analyses and derive more meaningful insights, we've categorized the continuous 'Age' values into discrete age groups: 'Child', 'Teenager', 'Young Adult', 'Adult', and 'Senior'. This transformation allows for a more aggregated view of age distributions and can reveal patterns related to survival rates based on age groups.
# Using pd.cut to categorize 'Age' into 'AgeCategory'
df['AgeCategory'] = pd.cut(df['Age'], bins=[0, 12, 19, 30, 50, 100], labels=['Child', 'Teenager', 'Young Adult', 'Adult', 'Senior'])
# Using PivotPal to get a distribution of the newly engineered 'AgeCategory' feature
pp.distribution(df, 'AgeCategory')
AgeCategory | Count | % |
---|---|---|
Young Adult | 245 | 27.50 |
Adult | 241 | 27.05 |
Teenager | 95 | 10.66 |
Child | 69 | 7.74 |
Senior | 64 | 7.18 |
The table above showcases the distribution of age categories among passengers aboard the Titanic. 'Young Adult' and 'Adult' are the most prevalent age categories, accounting for over half of the dataset. This engineered 'AgeCategory' feature simplifies the age data and provides a clearer perspective on the age demographics of the passengers.
Overview: Utilizing a custom function and the PivotPal Python package, we've engineered a new feature, 'Title', extracted from the 'Name' column of the Titanic dataset.
Explanation: Names in the Titanic dataset contain titles that can provide insights into the social status, gender, and age group of passengers. By extracting these titles, we can categorize passengers into groups like 'Mr', 'Mrs', 'Master', 'Miss', and 'Other'. This engineered feature can be crucial for understanding patterns related to survival rates based on social status or demographics.
# Custom function to extract titles from the 'Name' column
# Extract titles using a custom function
def extract_specific_titles(name):
title = name.split(',')[1].split('.')[0].strip()
if title in ['Mr', 'Mrs', 'Master', 'Miss']:
return title
else:
return 'Other'
df['Title'] = df['Name'].apply(extract_specific_titles)
# Using PivotPal to get a distribution of the newly engineered 'Title' feature
pp.distribution(df, 'Title')
Title | Count | % |
---|---|---|
Mr | 517 | 58.02 |
Miss | 182 | 20.43 |
Mrs | 125 | 14.03 |
Master | 40 | 4.49 |
Other | 27 | 3.03 |
The table above showcases the distribution of titles among passengers aboard the Titanic. The most prevalent title is 'Mr', accounting for 58.02% of the dataset. The 'Other' category captures titles that are less common. This engineered 'Title' feature provides a new perspective on the passengers and can be instrumental in further analyses.