Surviving The Titanic Disaster
In this article, we will exploring the titanic dataset and we also understanding and analyzing the titanic incident from every corner, In 1812, the world’s biggest ship titanic was drown, In this guide, we will trying to finding out the reasons and investigate the every element of the ship from close.
Table of Content
- A closer look at the titanic
- Conducting data inspection
- Understanding the data
A Closer Look At The Titanic
The Titanic was a British passenger ship that sank on its first voyage in the
North Atlantic in April 1912. The tragic event caused by striking an iceberg
resulted in more than 1,500 fatalities the estimate by US officials was 1,517
and by the British investigating committee, it was 1,503 from the 2,224
total number of crew and passengers. Most of the casualties were part of the
crew followed by third class passengers. How was this possible? The Titanic was considered an unsinkable vessel
when it was built using state-of-the-art technology in the early 20th century. This confidence was the recipe for disaster. As we know, it did sink, as the
contact with the iceberg damaged several water tight compartments enough to compromise its integrity. The ship was originally designed to
carry 48 lifeboats but only 20 were present on board and most of those were
carrying less than 60% of their full capacity when they were lowered into the
water. The Titanic was 269 meters in length and had a maximum breadth of 28
meters. It had seven decks identified with letters from A to G A and B were
for first class passengers, C was mostly reserved for crew and D to G were
for second and third class passengers. It also had two additional decks, the
boat deck from where the boats were lowered into the water and the Orlop
deck below the waterline. Although third class and second class amenities
were not as luxurious and comfortable as those in first class all classes had
common leisure facilities, like a library, smoking rooms and even a
gymnasium. Passengers could also use open air or indoor promenade areas.
The Titanic was advanced in terms of comfort and amenities compared to
other liners of the era. The Titanic started its voyage from Southampton and had two other stops
scheduled, one in Cherbourg, France and one in Queenstown, Ireland. The
passengers were shuttled with special trains from London and Paris to
Southampton and Cherbourg, respectively. The crew on the Titanic consisted
of around 885 people for this first trip. The majority of the crew were not
sailors but stewards, who took care of the passengers, firemen, stockers and
engineers who were in charge of the engines of the ship.
Conducting Data Inspection
The story of the Titanic is fascinating. For those interested in data
exploration, the data about the tragedy is also captivating. Let’s start with a
short introduction to the competition data. The dataset from Titanic
Machine Learning from Disaster contains three CSV comma-separated
values files, as in many Kaggle competitions that you will encounter.
- train.csv
- test.csv
- sample_submission.csv
We will start by loading these files into a new notebook. You learned how to
do this in the previous sections, in the Basic capabilities section. You can also
create a notebook by forking one that already exists. In our case, we will
start a new notebook from scratch. Usually, notebooks start with a cell in which we import packages. We will do
the same here. In one of the next cells, we would like to read train and test
data. In general, the CSV files that you need have similar directories as in
this example.
After we load the data, we will manually inspect it, looking at what each
column contains that is, samples of data. We will do this for each file in the
dataset, but mostly, we will focus on the train and test files for now.
Understanding The Data
In Figures 1.1 and 1.2, we get a glimpse of a selection of values. From this
visual inspection, we can already see some characteristics of the data.
Let’s try to summarize them. The following columns are common to both train
and test files.
- PassengerId — A unique identifier for each patient.
- Pclass — The class in which each passenger was traveling. We know
from our background information that possible values are 1, 2 or 3.
This can be considered a categorical data type. Because the order of the
class conveys meaning and is ordered, we can consider it as ordinal or
numerical. - Name — This is a text type of field. It is the full name of the passenger,
with their family name, first name and in some cases, their name
before marriage, as well as a nickname. It also contains their title
regarding social class, background, profession or in some cases,
royalty. - Sex — This is also a categorical field. We can assume that this was
important information at the time, considering that they prioritized
saving women and children first. - Age — This is a numerical field. Also, their age was an important feature
since children were prioritized for saving. - SibSp — This field provides the siblings or the spouse of each passenger.
It is an indicator of the size of the family or group with which the
passenger was traveling. This is important information since we can
safely assume that one would not board a lifeboat without their
brothers, sisters or partner. - Parch — This is the number of parents for child passengers or children
for parent passengers. Considering that parents would wait for all
their children before boarding a lifeboat, this is also an important
feature. Together with SibSp, Parch can be used to calculate the size of
the family for each passenger. - Ticket — This is a code associated with the ticket. It is an
alphanumerical field, neither categorical nor numerical. - Fare — This is a numerical field. From the sample we see, we can
observe that Fare values varied considerably (with one order of
magnitude from class 3 to class 1 but we can also see that some of the
passengers in the same class had quite different Fare values. - Fare — This is a numerical field. From the sample we see, we can
observe that Fare values varied considerably with one order of
magnitude from class 3 to class 1 but we can also see that some of the
passengers in the same class had quite different Fare values. - Cabin — This is an alphanumerical field. From the small sample that we
see in Figures 1.1 and 1.2, we can see that some of the values are
missing. In other cases, there are multiple cabins reserved for the same
passenger presumably a well-to-do passenger traveling with their
family. The name of a cabin starts with a letter C, D, E, or F. We
remember that there are multiple decks on the Titanic so we can guess
that the letter represents the deck and then that is followed by the cabin
number on that deck. - Embarked — This is a categorical field. In the sample here, we only see
the letters C, S, and Q, and we already know that the Titanic started
from Southampton and had a stop at Cherbourg, France and one at
Queenstown today, this is called Cobh, the port for Cork, Ireland. We
can infer that S stands for Southampton the starting port, C stands for
Cherbourg and Q for Queenstown.
The train file contains a Survived field as well, which is the target feature.
This has either a value of 1 or 0 where 1 means the passenger survived and
0 means they sadly didn’t.
The test file does not include the target feature, as you can see in the
following sample.
Once we have had a look at the columns in the train and test files we can
continue with a few additional checks to find the dimensions of the datasets
and the feature distribution.
- Check the shape of each dataset train_df and test_df, using the
shape() function. This will give us the dimension of the train and test
files, number of rows and columns. - Run the info() function for each dataset. This will give us more
complex information, such as the amount of non null data per column and the type of the data. - Run the describe() function for each dataset. This only applies to
numerical data and will create a statistic on the data distribution,
including minimum, maximum and first 25%, 50% and 75% values, as
well as the average value and standard deviation.
The preceding checks give us preliminary information on the data
distribution for the numerical values in the train and test datasets. We can
continue later in our analysis with more sophisticated and detailed tools, but
for now, you may consider these steps a general preliminary approach for
investigating any tabular dataset that you put your hands on.
Conclusion
Finally, In this section, besides the competition approach, we will introduce our
systematic approach to exploratory data analysis and apply it to get familiar
with the data, understand it in more detail and extract useful insights. We
will also provide a short introduction to the process of using the results of
data analysis to build model training pipelines. Before diving into the actual
data, it is useful to understand the context and ideally, define the possible
objectives of the analysis.