Titanic Data Exploration

This exploration delves deep into the Titanic dataset, a classic in the data science community. The dataset, which contains various features describing the passengers aboard the ill-fated ship, is dissected using the PivotPal Python package. The analysis focuses on understanding the dataset's structure, missing values, unique data points, and the distribution of key features such as age.

Beyond the initial exploration, the study also emphasizes feature engineering, a crucial step in data preprocessing. By extracting titles from names and categorizing age into distinct groups, the analysis unveils patterns that can be instrumental in understanding survival rates, demographics, and social statuses of the passengers. Each step in the exploration is meticulously detailed, providing insights and showcasing the capabilities of the PivotPal Python package in handling and analyzing datasets.

1. Dataset Overview: A snapshot of the Titanic dataset's general attributes, revealing its structure, data types, and characteristics.
Learn More:
2. Missing Data: A deep dive into the columns with missing values, highlighting the significance of each and potential implications for analysis.
Learn More:
3. Unique Data Values: A comprehensive look at the unique values present in each column, emphasizing the distribution and diversity of data within the dataset.
Learn More:
4. Age Distribution: An in-depth analysis of the age distribution of passengers, revealing the diverse age groups present on the ship.
Learn More:
5. Age Category Engineering: The transformation of continuous 'Age' values into discrete age groups, offering a more aggregated view of age distributions.
Learn More:
6. Title Extraction: A feature engineering step that extracts titles from names, providing insights into the social status, gender, and age group of passengers.
Learn More:

1. Overview Analysis of the Titanic Dataset using PivotPal

Overview: Utilizing the `pp.overview` function from the PivotPal Python package, we've extracted a comprehensive summary of the Titanic dataset.

Explanation: The Titanic dataset, a classic in the data science community, contains various features describing the passengers aboard the ill-fated ship. With `pp.overview`, we can quickly gain insights into the dataset's structure, missing values, data types, and more. This high-level overview is crucial for understanding the data before diving into deeper analyses.


# Using pp.overview to summarise the Titanic dataset
pp.overview(titanic_data)

Description	Count
Total Rows	891
Total Columns	12
Columns with Missing Values	3
Total Duplicate Rows	0
Most Frequent Data Type	int64
Columns with Binary Values	2
Columns with Zero Values	4
Unique Data Types	3
Numeric Columns	7
Non-Numeric Columns	5

The table above provides a snapshot of the Titanic dataset's structure and characteristics. With 891 rows and 12 columns, the dataset offers a mix of numeric and non-numeric data. Notably, there are three columns with missing values, which will require attention during the data preprocessing phase. The `pp.overview` function proves invaluable for such quick and informative explorations.

2. Exploring Missing Data in the Titanic Dataset

Overview: A deep dive into the missing data within the Titanic dataset using the PivotPal Python package.

Explanation: The Titanic dataset is one of the most popular datasets used in data science. It contains information about the passengers onboard the Titanic, including their age, cabin, and embarkation point. In this exploration, we'll focus on identifying and understanding the missing data within this dataset.

pp.missing(df)

Column Name	Missing Count	Missing %
Cabin	687	77.0
Age	177	20.0
Embarked	2	0.0

The table above showcases the columns in the Titanic dataset with missing values. The 'Cabin' column has the highest number of missing values, with 687 missing entries, accounting for 77% of the total data. The 'Age' column has 177 missing values, which is 20% of the data. Lastly, the 'Embarked' column has only 2 missing values, making up 0% of the dataset.

3. Exploring Unique Data in the Titanic Dataset

Overview: A comprehensive look at the unique data values within the Titanic dataset using the PivotPal Python package.

Explanation: The Titanic dataset provides a wealth of information about the passengers onboard. One of the key steps in data exploration is understanding the uniqueness of data values. In this exploration, we'll identify and understand the unique values present in each column of the Titanic dataset.

pp.unique(df)

Column Name	Unique Count
PassengerId	891
Name	891
Ticket	681
Fare	248
Cabin	147
Age	88
SibSp	7
Parch	7
Pclass	3
Embarked	3
Survived	2
Sex	2

The table above highlights the unique data values in each column of the Titanic dataset. Columns like 'PassengerId' and 'Name' have unique values for each entry, while columns like 'Sex' and 'Survived' have only 2 unique values. This information is crucial for understanding the distribution and diversity of data within the dataset.

4. Distribution Analysis of 'Age' in the Titanic Dataset using PivotPal

Overview: Using the PivotPal Python package, we've analyzed the age distribution of passengers in the Titanic dataset.

Explanation: The age of passengers aboard the Titanic varies widely, from infants to the elderly. Understanding this distribution can provide insights into the demographics of the ship's passengers and potentially reveal patterns related to survival rates based on age groups.


# Using PivotPal to get 'Age' feature  and it's distribution
pp.distribution(titanic_data, 'Age')

Age	Count	%
24.00	30	3.37
22.00	27	3.03
18.00	26	2.92
19.00	25	2.81
28.00	25	2.81
...	...	...
66.00	1	0.11
0.67	1	0.11
0.42	1	0.11
34.50	1	0.11
74.00	1	0.11

The table above showcases the age distribution of passengers aboard the Titanic. The most common age is 24, with 30 passengers (3.37% of the dataset). The dataset contains a wide range of ages, from infants (e.g., 0.42 years) to elderly passengers (e.g., 74 years). This distribution helps us understand the diverse age groups present on the ship. Now it's time to feature engineer...

5. Feature Engineering: Categorizing 'Age' into 'AgeCategory' in the Titanic Dataset using PivotPal

Overview: By leveraging the `pd.cut` function and the PivotPal Python package, we've engineered a new feature, 'AgeCategory', which categorizes passengers into age groups based on their age.

Explanation: The age of passengers aboard the Titanic spans a wide range. To simplify analyses and derive more meaningful insights, we've categorized the continuous 'Age' values into discrete age groups: 'Child', 'Teenager', 'Young Adult', 'Adult', and 'Senior'. This transformation allows for a more aggregated view of age distributions and can reveal patterns related to survival rates based on age groups.


# Using pd.cut to categorize 'Age' into 'AgeCategory'
df['AgeCategory'] = pd.cut(df['Age'], bins=[0, 12, 19, 30, 50, 100], labels=['Child', 'Teenager', 'Young Adult', 'Adult', 'Senior'])
            
# Using PivotPal to get a distribution of the newly engineered 'AgeCategory' feature
pp.distribution(df, 'AgeCategory')

AgeCategory	Count	%
Young Adult	245	27.50
Adult	241	27.05
Teenager	95	10.66
Child	69	7.74
Senior	64	7.18

The table above showcases the distribution of age categories among passengers aboard the Titanic. 'Young Adult' and 'Adult' are the most prevalent age categories, accounting for over half of the dataset. This engineered 'AgeCategory' feature simplifies the age data and provides a clearer perspective on the age demographics of the passengers.

6. Feature Engineering: Extracting 'Title' from 'Name' in the Titanic Dataset using PivotPal

Overview: Utilizing a custom function and the PivotPal Python package, we've engineered a new feature, 'Title', extracted from the 'Name' column of the Titanic dataset.

Explanation: Names in the Titanic dataset contain titles that can provide insights into the social status, gender, and age group of passengers. By extracting these titles, we can categorize passengers into groups like 'Mr', 'Mrs', 'Master', 'Miss', and 'Other'. This engineered feature can be crucial for understanding patterns related to survival rates based on social status or demographics.


# Custom function to extract titles from the 'Name' column

# Extract titles using a custom function
def extract_specific_titles(name):

    title = name.split(',')[1].split('.')[0].strip()

    if title in ['Mr', 'Mrs', 'Master', 'Miss']:
        return title
    else:
        return 'Other'

df['Title'] = df['Name'].apply(extract_specific_titles)

# Using PivotPal to get a distribution of the newly engineered 'Title' feature
pp.distribution(df, 'Title')

Title	Count	%
Mr	517	58.02
Miss	182	20.43
Mrs	125	14.03
Master	40	4.49
Other	27	3.03

The table above showcases the distribution of titles among passengers aboard the Titanic. The most prevalent title is 'Mr', accounting for 58.02% of the dataset. The 'Other' category captures titles that are less common. This engineered 'Title' feature provides a new perspective on the passengers and can be instrumental in further analyses.